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This Is the final report for NASA grant NAS S-3 which Is eoneemed with 
high-level language, computer architecture and algorithms for the Massively 
Parallel Processor (MPP). Previous work for this grant is described In Purdue 
technical reports TR-EE 30-32 and TR-EE 31 -4S. In this report only the 
recent up-dates to previous work and results of new work are given; previous 
work which Is not described In detail here Is mentioned In section one. 

The main effort of the research has been to design a high level language 
for the MPP. This language, called Parallel Pascal. Is described In detail 
in this report. Report sections Include a description of the Language design, 
a description of the Intermediate Language. Parallel P-Code, and details for 
the MPP Implementation. Appendices give formal descriptions of Parallel 
Pascal and Parallel P-Code. A compiler has been developed which converts 
programs In Parallel Pascal Into the Intermediate Parallel P-Code language; 
the code generator to complete the compiler for the MPP Is being developed 
Independently by CSC for NASA. A Parallel Pascal to Pascal translator has 
also been developed. This allows Parallel Pascal programs to be developed 
and run on conventional computers without the need for direct access to the 
MPP. 

In related work the architecture design for a VLSI version of the MPP 
Is completed with a description of fault tolerant Interconnection networks. 

In another section the memory arrangement aspects of the MPP are discussed 
and a survey of other high level languages Is given In an appendix. 
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1. Intraduetlen 

Tbit It tbf third and final rapnrt far grant KA@ §-3. Tbt athar ^ 
Inttrlm ragartt TR-EE 82-32 [1] and TR-EE-81-4S C2] aontain much Infermtian 
which It only briefly smarized In this report. The reader Is referred 
to these reports for a complete detailed description of the work conducted 
for this grant. 

The main concentration of the research effort has been directed towards 
the development of a high level language for parallel processors In general 
and the MPP In particular. Such a language, called Parallel Pascal, has been 
developed and Is described In detail In this report. 

In addition to this work, we have also conducted research Into advanced 
arehitecturss for Parallel processors such as the MPP and have programmed 
some algorithms for the MPP. In the remainder of this Introductory section 
the work In languages, architectures and algorithms Is briefly summarized 
and then an outline for the remainder of this report Is given. 

1.1 HI oh Level Languages 

There Is an obvious need for the availability of a high level language 
for programming parallel processors such as the MPP. In our research In 
this area we have considered languages based on APL, Fortran and Pascal. 

The majority of the research has been devoted to the development of a language 
called Parallel Pascal. A specification for a parallel APL Is given In [2], 
section 4 and the specification for a Parallel Fortran, which Is relatively 
simple to Implement, Is given in [2], section 3. In Appendix 6 a general 
discussion Is presented on the high level languages which have been developed 
for other parallel processors. This discussion Includes their relevance and 
shortcomings with respect to parallel matrix processors Including the MPP. 
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In this rtpoft itctton 2 contains tht raeont ^ivtlopatnts to tho lanpagt 
si net fapofts p3 and C2]» Tha I/O saetlon of tha languaga Is spaelflad Karat 
and Implamantatlon rastfletlons on tha coipllaf for HPP ara daserlKad. Kathods 
of prograinalng tha I/O for larga siza Itnagat with tha loplaiantad 
languaga ara also outllnad In this saetlon. In saetlon 3* tha design of 
Parallal Pascal Is prasanted. This saetlon starts with a discussion of tha 
design goals of the language and continues to Introduce the features which 
have Keen added to conventional Pascal In a logical, step by step fnanner. 

In Appendix A a formal specification of the Parallel Pascal language Is given 
Including a complete grammar. 

A large part of the research effort was directed towards the specification 
of an Intermediate compiler language called Parallel P-code which was developed 
from the P-Code Intermediate language used In many Pascal compilers. The 
design of this language Is presented In section 4 and a more formal language 
specification Is given In Appendix B. This language may he used for the 
compilers of languages other than Parallel Pascal such as Parallel Fortran. 

A compiler which compiles Parallel Pascal Into this Intermediate language 
has been developed as part of the work for this grant NAB 5-3. 

A preliminary description of Parallel P-Code was given In [2] section 
2.4, extensive revisions have been made to the language since then. These 
revisions were caused by the complexity of the new language features and the 
different memory systems which the data may be mapped onto. The resulting 
language Is at a higher, more symbolic level than conventional P-Code which 
gives the code generator more flexibility for optimization and allocation 
of memory. 

A Parallel Pascal translator has also been developed which Is described 
In [2] sections 2.2 and 2.3. This translator allows programs written In 
Parallel Pascal to be compiled and run on conventional computers which have 

a Pascal compiler. This is a very Important tool which enables Parallel Pascal 
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pftgraiit to 4ivt1o|)id» (tttxiggtd and tftttd ultHoyt tht NPP. Contacitiifitlyi 


gfograis iiy tt divilegtd on tho utofi loeit eanputtr at tht uiars eonvanlonet 
tvan atfoi*a tho KPP hordwart 1i avalUbla. 


1.2 CoMutiP ArchUactura 

Wa hava consldafad aavaral arehltaetura altarnatlvas and axtanslons to 
thi basic MPP design. Tht first faatura which wa considarad Important Is a 
hardwart bit-counting mechanism which can rapidly count tha number of bits In 
a bit-plane. This mechanism was considered to be Important for algorlthis 
Involving global feature extraction. A hardware bit counter design is pre- 
sented In [3] where It Is shown that a very large speed Improvement over 
current MPP bit-counting methods can be achieved at a small cost. Algorithm 
where bit counting Is Important are also discussed In [3]. In reference [4]i, 
which Is also In Appendix B of [2], algorithms are described for real-time 
Image tracking and It Is shewn that the MPP with the bit counting hardware 
could Implement these algorithms In real-time. 

The construction of an MPP like array using VLSI technology components 
has been considered In [lit section 4. A three chip set Is proposed consisting 
of a dense PE ALU chips a local memory chip and a fault tolerant Interconnection 
chip. The ALU chip Is designed for optimal bit-serial multiplication speed 
which Is much faster than the MPP design; It also has a table-look-up capability 
which Is not available on the MPP. An extended Interconnection schemes called 
the two-dimensional perfect shuffle. Is considered which overcomes most of the 
problems of the mesh Interconnection scheme used on the MPP; but the Imple- 
mentation cost Is very high. 

In this report, section 6, a more detailed design of the fault tolerant 
Interconnection ship Is presented. It Is shown that a very large amount of 
fault tolerance In the mesh connected array can be achieved at a very reasonable 
cost. This means that the construction of arrays much larger than the current 


MPP slit Of 128x)2S eii) bt eOfitl^irtO in tbt futun* A tteoosl poitlbillty 
for fOturt ^nflcterallon It to lipliiint tbt wbolo PE array on a tinolf 
illleon tilea eontlstlng of itny Intar^nnaetaO etilpf* J^eonffguratlon to 
avoid bad parti of tha eblpt can ba dona In loftwara onea tba til da hat baan 
complataly fabricatad. Furtharnora* If addittonil faults occur In tbt PE 
array at a Tatar tliaa, than softMara raconfigur&eicni can ba utad again to 
avoid thaia naw faults. 

ORK3INAL PAQi |f 
OP' POOR QyALITV 

1.3 AToorlttwtt 

Tha design of tha high laval languaga and tha computar archltactura should 
both ba algorithm driven. Me have consldarad typical algorithm Implemantatlons 
at both high and low levels. 

Tha does not have a tab1a-1ook*up facility* In [§] and also In [13 
section 3* an efficient mechanism for Implementing arb1tr:>ry functions on a 
bit-serial computer architecture Is described and a function compiler for the 


MPP has been developed. The specification of the function Is Input to this 
function compiler which generates an optimized subroutine In bit-level assembler 
code for Implementing the function. Any arbitrary function may be Implenented; 
however, the number of Instructions generated by the compiler Increases expon- 
entially with the number of bits of the function arguments. This technigue 
Is most suitable for argimtents of 8-b1ts and less and for functions which cannot 
be Implemented by a very simple direct method. 

Various different algorithms have been programmed In Parallel and are 
presented In [1] and [2]. In [1] section 2.3, prograns are given for PE address 
generation, Image rotation, bilinear Image resampling, maximum likelihood classi- 
fication. convolution. histogram pneratlon and Isodata clustering. In [23 
section 2.2.5, progranm are given for manipulating large arrays on the 128x128 
PE array of the MPP. 
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In rtftrtnci [2], sm tetfe piftlltl •IpHthm art diteuittd. AlfaHt « 


far loaal ^in4m ^#afat1onf Inelu^lna iGeat lofting an^ Ideal atidlan flltariRg 
ara daterlbid In taction 5 of £2]t thata algorltlmt bava alto baan gublltliad 
[4]. In Appendix B of [23, algorltbut for rapid taquantlal frsna raglftratloni 
anhaneoitant and feature extraction are daterlbad. 

In thit report tactlont 2 and 3 and Appendix A deal with the final tpael- 
fleatlon of the Parallel Pateal Lanpaga. Section 4 and Appendix B deal with 
the Intenadlate Parallel P-Code language. Section S It concerned with meiaoi^ 
fnanagenent Ittuet* I.e. mthodt to overcoine the llmltatlont cauted by the mall 
local memory size on the MPP. Finallyt section $ dealt with the detign of a 
VLSI Interconnection chip which completes the design for a future VLSI HPP-llke 
architecture. The majority of the effort during this last period In grant 
NAB 5-3 has been directed towards finalizing the design of Parallel Pascal and 
to fully consider the constraints of the MPP Implementation. The work on the 
Parallel Pascal compiler which generates the Parallel P-Code has been completed. 
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2. PirMlrt PMCtI Uptfitt 

Th« currant i^ael flection of Para 11 t1 Pascal It vary ilillar to tht 
ipaelfleatlon gfvan In 7R-EE 81 -4§. THa main ehangas* mada to eontroi strueturas 
array Indexing and I/0» ara OLtllnad PalOM In taction 2.1. A eettplata tpaelfleatlon 
It givan In Appandix A. 

Tha rattrletlont which will ha mtda to tha Initial MPP eoitpllar hava haan 
datantlnad and thata ara daserlhad In taction 2.2. Savaral axainplat of parfoiiilng 
I/O on tha MPP ara outlined section 2.3. 

2.1 Ravltlont and I/O Specification 

Tha only control construct which can hava an array eontroi variable Is the 
whera-do-otharwisa construct. This Is similar to the If-then-else construct with 
the following differences: 

1 . The eontroi expression may have an array data type 

2. All the targets of assignments must be conformable with the control 
expression, 1.e.» they must be either a similar sized array or a scalar. 

3. Both the do and the otherwise sections will be executed In sequence; 
the do first and then the otherwise with the complenent of the condition. 

The where structure Involves conditional assignment rather than conditional 
evaluation. Where structures may be nested with other where structures and with 
other conventlal control structures. 

In report TR*EE 81 -4S the concepts of subrange constant and subrange Indexing 
were Introduced. For example, the expression a[11..20] specifies a subvector of a 
consisting of elements aC11]through a[20]. This feature has now been extended to 
Include an offset expression. For example, a[5 8 11. /20] adds the offset 5 to 
:he constant subrange and results In the elements a[16] through a[25]. This offset 
has a similar effect to the shift function but Is notatlonally more convenient 
In some cases and may be used on the left side of . assignment. The offset 
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(nay at sptelfltd by an txprtislon wHtftaa tht lubrangt must bt a censtant. Tht 
syntax g1 van for subrangt Indtxlng has bttn ehangtd slightly si net tht 
grainmir using tht old syntax would no longtr bt of typt IL(1) which Is a 
rtguIrtiRtnt for tht slmplt parsing of Pascal. 

Tht I/O sptcificatlon for Parallel Pascal will be the same as that for 
conventional Pascal. Parallel array I/O will be done with files declared to have 
Parallel arrays as basic elements. Special techniques for dealing with very 
large arrays and for reformatting array data are outlined In section 2.3. 

The names of thi standard reduction functions have been changed; however, 
these functions are still defined In the same way. The old and new names 
for these functions are given below: 


Old Name 

New Name 

Function 

asum 

sum 

sum 

aprod 

prod 

product 

aand 

all 

AND 

aor 

any 

OR 

amax 

max 

maximum 

amln 

min 

minimum 


2.2 Thi MPP Para Hal Pascal 


Th9 Initial Parallal Pascal to be Implemented on the MPP will have several 
basic restrictions. Some of these restrictions may be removed with subsequent 
versions of the compiler. 

The most fundamental restriction Is that the last two dimensions of any 
parallel array must be 128 x 128 (or the last dimension must be 16384). It was 
decided that to not hide the machine architecture from the programmer In this 
way was necessary » at least In the early programnlng of the MPP» to ensure that 
well structured efficient programs are developed. The local memory Is very 
limited on the MPP and the processing efficiency Is greatly reduced If arrays 
are used which are not multiples of 128 x 128. For effecitve programming the user 
must be aware of these characteristics; In some cases different array sizes may 
dictate different programming strategies to efficiently Implement the same function. 
A future compiler may contain some built In strategies for arbitrary sized arrays, 
but these will not be optimal for all cases. Techniques for dealing with large 
arrays In MPP Parallel Pascal are disucssed In section 2.3 

Any arrays having the last two dimension other than 128 x 128 will be stored 
In the staging buffer. Only subarrays having the last two dimensions 128 x 128 can 
be directly processed; smaller subarrays may be "read'' or “written" by assignment 
statements. 
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A second restriction Is that parallel arrays cannot contain pointers or 
records. It Is possible for a parallel array to contain records without variant 
parts; however* a record of parrallel arrays Is probably a better data structure 
to use In this ease. Pointers are not allowed since they would. In general, point 

to a different memory system and would be very difficult to manage. 

Finally, there may be some minor restrictions to procedures or program 
blocks which contain parallel expressions. These will be a feature of the code 
generator which may be removed at a later date and are outside the scope of this 
report. 

2.3 Large Array Processing 

In this section the processing of arrays larger than 128 x 123 Is considered. 

Also a mechanizm for using the reformatting features of the staging buffer Is 

described. To aid clarity all declarations of parallel arrays which are to be 
located In the staging buffer rather than the PE array unit will be specified by 
buffer array rather than parallel array. Only single band arrays will be described, 
however multlspectral data may be easily accomodated by defining the arrays to 
have one more dimension. 

Large arrays will be considered to be of two types (a) arrays which will 
fit In the staging buffer and (b) arrays which are too large for the staging buffer. 
Type (a) arrays are considered first. 

2.3 1 Whole arrays In the MPP 

For arrays with the last dimensions being 128 x 128 the following program 
example Is typical for reading an array 


1 % . 
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IliSi 

Pixtl ■ 0..2SS} 

MPPA • paralUl array CO.. 127.0.. 1271 of pixali 
Vajr 

f : fila of MPPAt 
a: MPPA; 

Begin 

reset (f)j 
read (f.a); 


For arrays larger than 128 x 128 that are to be accessed In 128 x 128 chunks 
the following scheme may be used (for a 384 x 640 array In this case). 

Type 

pixel > 0..25Si 

MPPA ■ parallel array [0.. 127,0.. 127] of pixel; 

BUFA ■ buffer array [0..383, 0..639] of pixel; 

Var 

bf: file of BUFA: 
a: MPPA; 
b: BUFA; 

Begin 

reset (bf); 
read (bf,b); 


a 


b[0. .127,256 9 0..127]; 
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A steond altematlvo* If tht array If to !?• prooesfid as a slnglo tntity In 
tha MPP (as outllnad In seetion 2.S of TR-EE 81-4S), Is to eonsidtr tha array 


to hava ditnanslons 3 x § x 128 x 128 as shown halow. 

Tvoa 

pixal « 0..2SS; 

MPPL • parallel array Cl.. 3. 1..5, 0.. 127,0. .127] of plxal; 


Var 

fcj file of MPPL; 

c ; MPPLi 

Begin 

reset (fc); 
read (fc, c) ; 


2. 3.2 Partial arrays In the MPP. 

When the array Is too large to fit into the MPP then It must be stored on 
the disk In convenient sized chunks. The formatting of the data on the disk could 
and should be done by a back end processor since the processing requirement of this 
task Is very low and would be a waste of time on the MPP. Alternatively the 

reformatting could be done by the MPP at the beginning of the program. 

The data, therefore, Is In the form of a file of chunks; for simplicity we 
will consider each chunk to have the last dimensions 128 x 128 although they could 
also be a multiple of 128. Random access of these chunks Is possible but the 
seek time of conventional disk systems will make this scheme very slow. A "seek" 
function In Pascal would be very useful for random file access and could easily be 
Implemented In the Parallel Pascal compiler. A faster mode of operation, which 
is adequate for most applications. Is to spool the data through the MPP. In this 

case the sequence of disk accesses Is known and the data may be arranged on the 

disk to minimize seek time. 
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The spooling system eould he written as a set of Parallel Pascal library . 

functions. The following functions would he required: resets* open a spooling files ‘ 
reads* read the next block; writes* write the next block; and closes* close the 
spool file. Each reads and writes operation will access the next sequential block 
of the large array file. Pour modes of data access have been considered and are I { 

Illustrated In Pig. 2.3.1. Each mode is useful for a particular class of algorithm. 

In mode zero each block Is overlapped with the previous one by a specified 
amount. In this way some edge effects* caused by sequential block accessing* can 
be Ignored. 

In the simple mode* mode one, there Is no overlap between blocks. This Is 
the simplest mode to Implement and Is adequate for point operations but edge ; 

I 

effects may cause problems If near neighbor Information Is used. 

The three near neighbor mode provides an alternative method for near neighbor ^ 
processing, especially for large window operations. The near neighbor chunks { 

provide sufficient edge Information for rotation and geometric distortion problems 
also. 

The eight near neighbor mode, mode three. Is useful when not enough near 
neighbor Information can be obtained with mode two. 

The near neighbor accessing modes (zero, two and three) must use a large 
part of the staging buffer to minimize the number of disk accesses. When possible, 
several rows of chunks will be kept there. Much use of pointers will be used In | 
the reads and writes procedures to minimize the number of data transfers. 

2.3.3 Data Reformatting 

The MPP staging buffer has the capability of reformatting data flowing through j , 
It In a large number of ways. This feature has not been explicitly used In 

■ 

Parallel Pascal, although It Is used Implicitly when reading 128 x 128 chunks. > 
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Nme 


0 overlapped 


1 simple 


Gita Aeeeased 



2 


3 


near neighbor 


3 


8 near neighbor 



1 

i 


f 



i 


I I 
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• « 

,i I- 


Flg. 2.3.1 


Spooler data accessing modes 
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Explicit use ef the fcfcymtting festuft Is useful In two naln appllcitlens 
(a) fefhen the data file Is In an unusual fonnat and (b) wfien the data In the PE 
array Is to he redlstrlhutad In the array for iiore efficient processing. 

These functions should he treated separately, file" reforinattlng specifies 
the programs view of the outside world and has no Impact on the algorithm to he 
Implemented; for this reason the format should he declared once and then conven- 
tional read and write statements may he used. On the other hand. Internal data 
redistribution Is a part of the algorithm to he Implemented and this should he 
made explicitly clear. Methods of Implementing data reformatting In Parallel 
Pascal are outlined below. On the MPP Implementation the new functions would 
he built Into the compiler. 

File reformatting will he achieved by a procedure called reformat which 
takes a file Identifier as an as an argument and regenerates the mapping 
parameters for the staging buffer. Other parameters to reformat specify the 
new permutation. The final parameter format for reformat has not been 
specified; however, a typical example using provisional format Is outlined below. 

Consider that a multlspectral Image 1s organized In 128 x 128 chunks of 
6 bands which are Interleaved and the standard Parallel Pascal I/O usually 
expects the data In a non- Interleaved form. 

The reformatting of the data can be achieved by a single call to reformat 
as shown In the following program segment. 

MPPM • parallel array [ 1.. 6,0.. 127,0.. 127] of 0.. 255; 

Van 

1 J liliSl 0..127. 1..6] of 0..255 

a : MPPM; 




•■4|l»{KaBrr1r'«.9(K~T 


iMift 

rsfoimat (f»2»3,1); 
reset (f); 
read (f»a); 
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The file declaration speelfles the external format of the data while the 
reformat parameters specify the dimension reorderlno to achieve the correct Internal 
format. 

Reformattino data In the PE array Is achieved with three orocedures: 
resetf, readf, and writef. The first parameter to resetf Is a virtual 

channel number (an* Integer) which has a similar function as a file Identifier 
except that there Is no disk file associated with It. The remaining parameters 
of resetf completely specify the data transformation to be made. 

Readf and writef are similar to read and write except that the virtual 
channel number replaces the file Identifier. Resetf Initializes the virtual channel 
then readf and writef can be used as If It were a conventional Pascal file. 
Sufficient data must be written by writef calls before readf Is used. 

For example, suppose that we want to combine two 128 x 128 arrays to form a 
2 X 128 X 128 array, and the e1@nents are to be shuffled together by the staging 
buffer. The following program segment Illustrates how this might be organized 
const 

vchan » 1; 

Type 

MPPA ■ parallel array [0. .127,0. .127] of 0..255; 

MPPB ■ array [1..2] of MPPA; 




Vap 

a»b; HPPAs 
p ! MPPB; 

Btgin 


16 


s»sss« 


pesetf 

wpitef 

WPittf 

readf 


(vchan, <p«ptnutat1on papani8teps>); 
(vehan* a)t 
(vchan, b); 

(vchan, p)s 


The viptual channel, vchan. Is still open and may be used fop subsequent 
data permutations. In general, the data transfers are not restricted to having 
the last dimensions 128 x 128 since the data Item being transferred may have 
subrange Indices. 

For the more restrictive, but useful, case where a single array Is to be 
permitted then a single function, called map, may be appropriate. Map Is used 
In the following way; 

a:« map (b, <permutat1on parameters>); 
where a and b are arrays with the same number of data elements. Map may be 
programmed 1n Parallel Pascal if the functions resetf, writef and reaf are 
available. 


OmiNALPAOEtt 
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3 1 PARALLgj PASCAL DISIQN 


3*1 Moeivaelon 

P4iral^«l Pascal is a hlgh*laval aatrlx languaga daaignad tot 
eha usar of a pavallal tsa&rlx procaaaor* Tha daeiaion to daaign 
a na«r laaguaga raflaets a dagvaa of dlaaatlafaetlon with 
faellitlaa availabla in axisting languagaa* Foe aoaa raaaoti (or 
eonbinatlon of raaaona) no aingla axis ting languaga waa judgad to 
ba sultibla for parallal matrix proeaatora* 

To judga tha suitability of a languaga. It la nacaaaary to 
conaldar tha functions It la to aarva* Wulf[ll daflnaa thraa 
goals of a programming languaga t It la a daslgn tool, a vahlcla 
for human communication, end a vahlcla for Instructing a 
eomputar* Tha languaga which la choaan for a particular 
application should ba ona which aatlaflaa all of thasa eritaria* 

Programming can ba conaidarad to ba tha act of mapping a 
problem Into machlna coda (21 . This mapping occurs at two lavala* 
The original problem la tranalatad by a human Into a program In 
aoma language, and then this program la tranalatad by a compiler 
(or aaasmbler) Into machine code* Each translation Involves tha 
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lo*a &e ici£o»aeioo - eM psograM aoatalat ltt£oraa&ioo ehan 

eha peobiaa and etia aaebina coda eoaaalaa laaa iaforaabioa fibaa 
tha peograa* Ua£ee£ttaabaly» ebaaa palaeioaahlpa aea of&aa dual 
ia aaeura - a laaguaga wliieh faellltaCiaa peogaaaaiag by buaaaa 
(aad edBBualeaciea aBoag ebaa) will o£eaa ba aoaa dlffieuie eo 
eoaplla iato aaeblaa cada* 

Xa raeaat yaara* a gvaae daal &t aapbaaia baa baaa placad 
upon eba uaa of a aae of eaebalquaa eollaeelvaly eafaerad 6o aa 
'*aeruoeurad progaaBalag*' ' (Tba oplaleaa oa ehla aubjaee bava 
baaa by ao Baaaa uaaaiaoua; a dtaeuaaion of eba aaaiea aad haeaa 
of a auBbae of eba ''eaaata'' of aeeueeurad paograaalag la glvaa 
in rafaraaca 3 aaoag aaay oebara*) Tbaaa eacbaiquaa aaeouraga 
eaeaful, sagulart modular daalgaa, ebaraby faellieaeing tba 
eoBaeruetloa of programa ubleb ara highly rallabla aad 
maiaealaabla* 

Tabiag eba abova faeeora iaeo eoaaldaraeioa» a ''good'' 
laaguaga ia oaa which facilieaeaa coamuaicaeioa among bumaaa aad 
baewaan humans and maebinaa* ona which parmiea aapraaaion of a 
problem wleboue undue loaa of informaeioni one which can ba 
compiled ineo reasonably efficiene machine code, and one which 
encourages seruceured programming eechniques* 

Having determined what faceore are naceasary for a "good'' 
language* ehe development of Parallel Pascal can now be 


considered 
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3*2 fai&ai tpacif teafetoo 

3*2»l Paiian Coaii 

8tae« oen« of eho ovAlloblo loaguagoa woeo oneisoly iuleablt 
foe laploBoaeaflon on a paeallal oaer&x peoeaaaoei eha daalgn of 
a aov laaguago uaa uodaeeakan* Tba doalgn goal of ehia now 
laoguagt (which avaaeually baeana Paeallal Paaeal) for a paeallal 
aateia peocaaaoe waca 

o Tha languaga ahould ba afllelanely laplaiaaacabla* A 

prlQcipla raaaoQ for ualng a paralial proeaaaor la eo obealn 
eha aaalBua poaalbla axaeueloo apaad} a languaga whoaa 
laplaoaoeallon la eoaely algnlf Icanely dlmlnlahaa eha 
advancaga of paralial proeaaaora ralaeiva eo mora 
convaneional (and familiar) aaquaoelal proeaaaora* 

o Tha languaga ouac parmie eha dlrace apaelf leaeion of 
parallalltm* Thla ralataa atrongly to tha pravloua 
objaetlva - tha dlract spaelf leatlon of parallallam produeaa 
mora afflelant programa than tha axtractlon of Inharant 
parallallam by a eompllar* 

o Tha languaga muat ba aaay to laarn and uaa* Such a languaga 
f acllltatea eommuulcatlon among humane and batwaan 
programmara and eomputara* A languaga which la difficult 
will ba avoided by Ite uaara whanavar poaalbla* 

o Tha languaga ahould not require tha uaar to have an intimate 
undaratanding of the hardware upon which it la Implamantad* 
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Bteaut* it (by dttigo) tfeieitotly ispltataeablt tad 

ttty 60 Itten tad utt| it not ehotta tt 6 ht bttlt foe eht atw 
ptfftlltl Bterix Itagutft* Tht ettultiag Itagutgt »tt ehtrtfort 
atatd **Ptrtlitl Pttetl"* Tht folioalag eristrlt wttt uttd la 
tht tpteif letgioo ot Ptrtiltl Pttetli 


• Ptrtlltl Pttetl it ta txetatioa eo tetadted Ptteti* At 
tuehi ie thould bt fuliy upvtrd-eoapttibltt thte it» tay 
Pttetl progrta thould tlto bt t valid Ptttlltl Pttetl 
pvogrtm* 


t Ptrtlltl Pttetl txttatioat to Pttetl thould bt eootitttat 
»ith tht dttiga philotophy of Pttetl* Tht dttiga thould bt 
otthogoatl tod tht atw fttturtt thould oot dttttet feoa tht 
etrtful program eoattruetioo ptrmitttd by Pttetl* 

Whta dteidiag upoa txttotioat to t Itagutgt » it it otettttry to 
eoatidtr tht tppliettiont for whieh tht Itogutgt will bt uttd* 
Somtoot ooet ttid that t gtntrtl-purpott tytttm (or proetttor* or 
Itagutgt) it ont whieh dott dott many thinge but whieh dott aont 
of them well* To avoid tht trap of iapltmtatiag tvtrythiag that 
tnyoat would pottibly wtati atw fttturtt wtrt eoatidtrtd ia light 
of tht dttirtd tppliettlooa trtt - iotgt proettting and dtnat 
matrix aumtrietl tlgorlthmt (£•£,• partial difftrtatial 
tquttioat ) * 


Tht following oeetlont dtterlbt tht Parallel Patetl 
txttntiont to standard Ptocal* Tht dtvtlopmerst of each exttntlon 
will bt diteutatd* 


X 
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3*2»2 Typte 

lo ord«e eo eh« dtsign obj«eeiv« ehae paralltllta b« 

diraetly «xpetttlbl«» « taieAbi* teruceuea nuse b« chotsii* 

3iae« ehlt tpiclf leaeion thottld b« «• eoapaelbla as possibis wieh 
•taodard Pascal, it is iasesuceiva &o flrse eofisidsr 6hs daea 
structuring provided by Pascsl* Indssd, Pascal's flsxlbls data 
type facility is oos of its most significant fsaturss* 

Ths most basic Pascal data typos arc the prsdofinod 
primitive types ^'integer", "real", "char", and "Boolean", 
and ths scalar types , a user-defined scalar type associates with 
the type name a set of distinct identifiers. This permits the 
programmer to use mnemonic names rather than arbitrary integer 
constants , which in turn improves program readability and 
facilitates compile-time error checking. 

The range of values which may be assigned to a scalar may be 
restricted by defining a subrange type . A subrange type 
definition comprises a base type (either a user-defined scalar 
type or a primitive type other than "real'') and a range of 
legal values. Hence, if type ^'x'' is defined by 

type X • 1. .3; 

then a variable of type "x" may legally take on only the values 
I, 2, 3, 4, or 3. Like simple scalar types, subrange types aid 
in program documentation and compile-time error checking. Also, 
subrange types provide information to the compiler about the 
amount of storage required for a variable of that type} in the 


jrs;:ir::r2*r3ia!e' 
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above tKampla only 3 bi6a of itorago naad ba alloeata4 to apaelfy 
any lagal value In type Providing ehara la hardvara 

■upporip eha eompllar stay chooaa eo adjuaig; eha apace allocaead co 
a aubranga eypa depending upon 6ha available hardware 
rapraaaneaelona* 

^ power aac may ba defined for a acalar or aubranga eypa* A 
power aae In Paae&l la eoneapcually eha aacia aa a aae In 
maehamaelca ; le la a collaceion of alataanta, eha coopoaieion of 
which changaa ae runelme. The baaa eypa (eha aubranga or acalar 
eypa ovar which le la dafinad) daearmlnaa eha leaaa which may 
belong eo eha aae* 

There era ewo daea aeruceurlng faclllelaa in Paacal» eha 
array and eha record . An array la a homogenous ordered aae of 
leama* The alamanea of an array may ba of any eypes acalar » 
aubrangOp aae* arrayp or record eypa* Associaeed wlch each array 
componane la an "Index"; eha range of ehls Index may be 
ape . ^f lad by a scalar or subrange eypa* A record Is a non- 
homogenous collaceion of learns* The componenes of a record may 
be of any eype» and may occur la any order* Thera Is also a 
provision for ehe overlapping use of seorage by allocaelng 
elemanes which are mucually exclusive Into ehe same seorage area 
- ehls Is achieved through the use of a variant record * 

Pascal also provides pointer types * These are defined by 
the compiler and initialized at runtime by the user-controlled 
dynamic storage allocation routine ("new")* They contain 
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«ddr«ittt and nay ba eopiad and eonpacad (far aqualiby)i bhua 
paealetlng eha eonatruceion of data aeruceuraa aueh at llnkad 
IlatOi traaa» atet 

Tha targat archlcaeeura for Parallal Paaeal - a parallal 
matrix proeaaaor - cooatata of « aat of Idaatleal axaeution unita 
which parforo tha aama oparation at tha aama time* Tha hardwara 
thua appaara aa an ordarad collaction of homogaa{>ua proeaaaora. 
This organisation maps naturally into tha array atruetura 
providad by Paaeal; hanea> tha array waa choaan aa tha vahiela 
for tha oxpraaaion of parallaliam in Parallel Paacalt 

Often a parallal matrix proeaaaor will ba eloaaly coupled to 
a more conventional proeaaaor* For axampla» tha Maaaivaly 
Parallal Proeaaaor eontaina a main control unit which in a 
conventional 16-bit minicomputer* In addition* tha MPP ia 
attached to a boat machine* a VAX-11/780* In auch an 
environment* it may be more efficient to perform scalar 
operations on one processor and matrix operations on another* 

This in turn is reflected in tha assignment of atorage to 
variables used in the program - those variables which are used In 
a scalar fashion may be physically located in a different memory 
than those used In an array fashion* Parallel Pascal provides 
for this situation by permitting an array to be declared 
parallel ; 

xxx: parallel array [1**S] of Integer; 

The parallel keyword Is a means by which the programmer can 
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«dvlt« eha eompiXar ehae eh« array la ehia eaaa) will ba 

haavlly uaad in a parallal faahlon. Soma eompilar 
Implamaneaciona may ehooaa to Ignora rhla (a»g* if ehara la ooly 
oaa eypa of mamory)* Tha eoneapt chat eha usar la advising eha 
eompilar aboue tha Implamantatlon la similar to tha raalstar 
keyword In cha language C[41* 

It la Important to note at this point that, aside from tha 
poaslbla dlffarence In physical storage, arrays declared as 
parallel are syntactically and semantically equivalent to 
'^ordlnary^' arrays In Parallel Pascal* 

3*2*3 Array Indexing 

Having chosen the array as the vehicle for expressing 
parallelism. It Is necessary to specify the manner In which that 
parallelism is to be expressed* The logical starting place Is 
the building block of any computational language - the assignment 
statement* It viust be possible to specify the evaluation of 
array quantities In a simple, direct form. 

Standard Pascal provides an array assignment statement; if 
''a'' and ''b'' are the same type then the statement 

a :■ b; 

specifies that each element of ^'b'^ is to be insigned to the 
corresponding element of ''a'^* A natural extension of this 
concept is to allow arrays of the same type of participate in 
arithmetic operations, for example, given 
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U*»9] Inetgtc; 

1 t ine«g« r ; 

ehtt teaetaane 

a :« a 4> b; 

would aehlava eha aama raaule aa 

for 1 »■ I to 5 do 

aUI :• aCl] + bCUi 

Hhila axpraealona Involving Idantlcal arraya ara uaaful» 
they ara Halted In tha ranga of problama to which thay can ba 
appllad* Savaral daflclanclaa ara evldant* Flrati It la 
nacasaary to ba abla to aelaet a portion of an array (for 
Inatance, a row or a coluan) rather than the entire array; henca» 
aoma additional Indexing machanlama ara required* Second i It la 
nacaeaary to allow arraya of different typea (but Identical 
shapaa) to ba combined In an arithmetic expreaalon* 

A number of achemea have been propoaed for array Indexing » 
aa deacrlbed above In the dlacuaalon of other parallel languagea* 
The array Indexing facllltlea which are provided muat be powerful 
enough to aolve uaeful problemai while remaining almple enough to 
efficiently Implement* The choice of Indexing mechanlama ahould 
therefore begin with the almpleat and proceed toward the more 
complex* 

In standard Faecal, each array index may be specified by a 
scalar constant or expression* This is the simplest form (and 
least parallel) of Indexing permitted in Parallel Pascal. When 
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aa array la lodaxad by a acalar lea rank la (eoneaptuall^y) 
raducad by oaa* Whan a ona-dloanalonal array la Indaxad by a 
acalar eha (logical) eypa of eha raaule la a pointer to a acalar* 

It waa atatad above that atandard Paacal paralta arraya 
participating In an aaalgniaant atatamant to be unaubaerlptad* 

Thla la actually a apodal caaa of a more general feature In 
atandard Paacal - It la poaelble to elide (omit) the rightmoat 
indleea In an array aaalgnment» provided the reaultlng 
expreealona are of the aame type* For example, given the 
'definition 

var a,b: array [1**3, 1**10] of Integer; 

both of the following are legal aaslgnmenta In atandard Paacal: 

a b; 
all) :« bin ; 

The flrat atatement aaalgna to each element of ''a'^ the value of 
the correapondlng element of ^'b''* The aecond statement 
performs this action only on the first row of '^a" and "b''* 

It has the same effect as: 

for 1 :■ 1 to 10 do 
all,!] :• bll,ij; 

Parallel Paacal extends this to permit the omission of any index; 
hence. In Parallel Pascal the statement 

al»lj •“ b[,l]; 

assigns to the first column of ^'a'' the values contained in the 
first column of The use of a scalar Index effectively 
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r«due«t eh« f«nk (ouabcr of diaonoiono) of on or coy by ono; honeo 
it eoootdorod eo bo o voetor* 

Tho oblllty to toloet o roi? or eoluan* oo oppoood to on 
ontlro ortoy» it uooful{ howovor* it It of ton doolroblo to 
furthor rootelet tho nuaboe of olomontt which portlcipoto in on 
oporotion# A connon roquiromont it tho toloetion of o oubtot of 
o cow» eoluan* or both. Stondord Poocol providot no tymbolion to 
dlroetly oxprtit thit concopt; honeoi it it noetttory to 
Introduct o now eonttruet to tho longuogt* 

Tht tiaplott tubttt of o tot of array indieoo it a 
contoeutivo range* For oxomplt, given on orray with five 
oltDontOi ono may with to aeeota olomontt 2, 3» ond 4* Standard 
Poteol ptrmitt tho uto of o tubrongo in typo dofinitiont to 
tpoeify a rango of voluoa which a vorlablo may pottott* Porollol 
Patcol txtonda thit concept by defining a tubrongo conttont » Tho 
Patcal contt ttatomont may bo uaod to define an idontifior at a 
tubrongo conttont: 

contt rongocontt ■ low* .high; 

A tubrango constant may be added to a scalar oxprostlon and 
used as an array index* Tho most desirable syntax for this would 
bo 

arr [scalaroxprossion + low* .high] 

(or) 

arr [scalaroxprossion * rangeconst] 
where ''rangeconst'' la an Identifier defined as a subrange 
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eoascftae* UafortuMetly, dut to tho toeuraivo-dooeoat 
laplaaoQtotlon of most Psseol eompilara» this syntax introdueos 
eomplieations mhon a eompllar parsaa tha program* Xn dafaranca 
to tha implamantatlon tha aymbol is usad to raprasant tha 

addition of a aubranga eonatantt 


arr (acalarindax d low**high] 

(or) 

arr [sealaraxprassion @ rangaeonst) 


Givan tha following dafinitions: 

eonat 

rr ■ 1**5; 
ee ■ 2* *4; 

var 

a,b: array U**10,l**10] o^ intagars 
i,j: intagar; 


tha following two coda saquanees aehiava the same result: 


(* with subrange indexing: *) 
a[0@rr» O0ccl :■ b[l@rr» 3@eels 

<* without subrange indexing: *) 
for i :■ I to 5 do 

for j :■ 2 to 4 do 

aUfjl bU+i, 3+Jl} 


Subrange indexing does not alter Che rank of an array* Thus, 
while is a 10-element vector, ' * a[0dl * • I , ] ' ' is a IkIO 

matrix* 


Other languages provide additional array indexing 
facilities, such as indexing by a logical set or a vector* These 
indexing notations are powerful, but on a processor with a 
limited interconnection network (e*£* a mesh network) their 
implementation can be very expensive* For this reason, set and 
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v«eCdr lQd«»log w«r« «xelud«d froa eh* ■p«elf leatioo of Porallol 
Faacalt 

Tho ablliey to oXldo indlcoa and uaa aubranga eonatanea tov 
array Indaxlog bringa wieh If; an aaaoeiarad problaat wbae 
eoabinationa of array axpraaalona ara lagal? In aeandard Paaeal* 
eba oparanda of an arlrhaaeie axpraaaion auae ba tyna coapaelbl.a > 
A aubranga la Cypa eoapatlbla uleh Ita baaa eypai and Ineagara 
ara rypa eooparlbia with raali* (An Ineagar may ba eonvarcad to 
a raal numbar »ith no loaa of Information. Thut» If an Intagar 
axpraaalon la uaad whara a raal axpraaalon la raqulradi a«£. on 
tha right-hand alda of an asalgnmant a<’atamant or aa an argumant 
to a function or procadura» Paacal automatically convarta tha 
Intagar axpraaalon Into a raal axpraaolon* Slnea It la not trua 
that any raal may be converted to an Integer with no loaa of 
Information! Paacal prohlblta the oppoalte caae - ualng a real 
expreaalon where an Integer expreealon la required.) 

Parallel Paacal preaervea the Paacal concept of type 
compatibility. Becauae of the array Indexing* acalar type 
compatibility alone la Inaufflelent to determine the 
conf ormablllty of array expreaalona. It la neceaaary to alao 
conalder the rank (number of dlmenalona)* elze, and Indleea of 
each array expreaalon. The apeclf Icatlon was dealgned to meet 
the following goals (note that the term "array'* below may refer 
to an entire array or a subset created according to the indexing 
facilities described above): 
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• Aesty of eht tomo eypo oHould bo eompoelblo* 

• A oeoloe of eho ooao boot bypo oo on oeroy ohould bo 
eoafoPBoblo to thot ortoy* Thio Implioo thot tbo oeolor lo 
offoetivoly roplleotod Into on oetoy of idonttcol typo* 

• Arroyo which differ in ronk (nuabor of diaonoiono) or oico 
oro not eoapotiblo* Roeoll thot indoting with o oeolor 

* ^coaprooooo" o diaonoion out of tho orroy * 

• Arroyo which hovo tho ooao indox rongooi but whooo oloaont 
typoo oro difforonr oro eoapotiblo if tho oloaont typoo oro 
eoapotiblo* 

o Arroyo which oro tho ooao oizo ond ohopo ohould oithor bo 
eoapotiblo t or it ohould bo poooiblo to aoko thoa eoapotiblo 
with littlo offort* 

Tho first roquiroaont obovo prooervoo tho otondord Pooeol 
orroy asoignaont ototoaontt 

a :■ bi 

whore ^'o" ond '^b" oro of idontical types* Tho second 
roquiroaont allows tho use of seolars with array expressions; 

£.£* given that "a" and "b" are of the same type, and "c" 
is the same typo as the elements of ''a'' and "b'', then the 
following statement: 

a ;• b + c; 


adds 


to each element of ''b'^ and stores the result in the 
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eorettpoadlQg •l«o«ne of 

Awy» BfB rt^ulfitd bo bo ebo oomo obopo ood olto bo provoob 
■Ibuobiono ooeb ooi 

vot 

01 Qggoy of iobogogi 

bt Qggoy [I* *6) ^ iobogogs 

b t" Mi 

Allowing oggoyo wblob «go bbo oomo bypo oxeopb fog bbo 
olomonbo» whieb ago compablblo* allowa common conabguebiona gueb 

aa 

vag 

at aggay of Inbagog; 

bt aggay (l**5j ^ goal; 

b t« at 

Tba blggaab dlfficulby wt\lch agloao from bha ganaraligad 
indexing machanloms lo bha eompablblllby of arraya whoaa rlzae 
and shapes are Idenbleal* bub whose index ranges are nobt 

var 

at array U**S] of, inbeger; 
bt array (2«*6] o£ Inbeger; 

a t ■ b t 

This problem becomes more severe when bhe concrol*flow facillblss 
of Parallel Pascal (which are discussed labor in this chapter) 
are used* In order bo prevent ambiguity in these cases* two 
arrays with non-identical index ranges are compatible only if the 
elements of at least one are specified explicbly by subrange 
indexing. Hence* given the '^a'^ and "b'' defined above* the 
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following ossiganontt «r« oil logoi t 

0 t* b(0Q2* • 6] { 
o(06lt*Sl I" l)( 
o(00l..5J I- b(O02».6]t 


Tho eypo eonpoeibltey rulot dooerlbod obovo oxeoad la eho 
laeuieivo woy eo muletplo dlaonoiono* If ond **d" oro 

doftnod by 

vor 

et orgoy of, laeogor; 

di otroy (l««8»l**7] of, intogor; 

chon eho following moslgnmooco oro all logali 

e d[262» •4»0@2« #61 i 
e(0@1..3»] d[066..8,0@2.#6]; 

c[l,l j- as 

(Nota Chat in tho loot oxamplo eho ranko of ''cU»]" and 
ara tha aaraa bacauaa eha aealar ludaxing raduead eha rank of 
' ' c' ' by ona • 

3.2. A Standard Functlona 

Paaeal provldaa a numbar of atandard f unctlona and atandard 
procadnraa . Thaaa parforo varloua aarvleasi including eypa 
convaraion (a#g,# trunc . ord ) arlthmatlc functions (a#£# sin # 
acre ) . and input/output procaduras# This last group is discussad 
in mora datail in saceion 3#2#6# 

Many of tha standard functions perform simple 


transformations# for Instance: 
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yg s» yt g««l; 

X !■ •qgfiCy); 

Xg it quiet ateugtl eo etiluk of ehttt funceiont tt txetntioat eo 
eht tte of oporteort provldtd bp Pttetl '*•**)• 

Siuet Ptgtlltl Pttetl tllowt eht optrteogt eo toe upoa tggtyt tt 
tgggtgtetti ie it only ateugtl eo txetad ehit ftteugt eo eht 
tetadtgd fuaceioat* Thut» 

vtg 

F* tggtv 0^ goal; 

ii inetgtg{ 

X t» tqge(y){ 

it tfftceivtly eht ttmt tt 

fog i I* 1 eo 16 do 

xUl *• tqge(yril)! 

Thttt tetadtgd fuaceioat ttt| la t ttatt» ''gtatglc^'i ehty 
Qty bt uttd wieh tggtyt of tay thtpt* Tht vtlut gtcuraod by eht 
fuaceloa htt eht tamo iadtx gtagtt at let aggumtae* Sloct ehttt 
fuaceioat opttaet ladtptadtaely upoa ttch tggty tltotae* ehty tgt 
ctlltd tltntaetl fuaceioat * 

While eht tlttataeal fuaceioat tgt uatful» eht tfftceivt utt 
of Ptgalltl Patctl gtqulgtt eht utt of fuaceioat which aletr eht 
tetuceurt of agrayo la a mort complex faahlon* Thete fuaceioat 
r gtftrred eo at egaatf ocmaeloaaX f uaceioot . 

Tht flrae eypt of array reaeruceurlng which Parallel Pascal 
provides It Che reorderlag of array elemeacs* Cercala image 
pgocetalag algorlehms (t*£* coavolueloa) require the capablliey 
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to no¥o data vlttilii aa ar^ay* ttiia ao^asaot say taka two fo?aa» 
a ''akift", la wkieti data la ahUtad off tha adga of tba array 
and aaroa ara brought io froa tha othar a»d» or a **rotata*'i la 
which data la aovad wlthia aa array aueh that data ahlftad off 
oaa aad will raappaar at tha other* Parallal Paaeal provtdaa 
both thaaa fuactioaat 

ahlftCarray, i.* i,* ioi •••) 
rotata(array» X|» * 3 » •••) 

where array la the array naoe aad la the aagnltuda of the 

■■■ ■ ■■ii. fa ' Q 

ahlft along dlaenalon n* (Row aajor order la uaed*) 

Zn addition to shifting (rotating) data. It la aoaetlaea 
neeeaaary to tranapoae two dimensions of an array* This Is 
performed by the trans funetlont 

trans(array, dlm^, dlm 2 > 

This effectively swaps two Index ranges* For Instance, given the 
definitions 

vat 

xs array (0**7, 3**4, 6. .101 of Intoger; 
ys array (6**10, 3**4, 0**7] ^ Integers 
l,j7^s Integers 

then the statement 

y :• trana(x, 1, 3)s 

I 

Is equivalent to 

for 1 8» 0 tjo 7 do 

for j f 3 to 4 do 

for k :• 6 to 10 do 

ylk,j,l’Ps* xTT, J,kls 
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Th« iteeiid oajot tfp* of «rr«y OAHipislotloo Is th« 

oleoroeioa of ebo auobot of difnobtloiit of on oeToy* Seao oriray 

optraeiono nay taqulta ehae a array with M diaaoaiona ba o'^^nbinad 

with a array wleh K^l diraantlont* (Ona axaapla la rha 

coapueaeion of eha aaerlx produce 7[x whara A la an okb aaerlx and 

% It an a-alataane vaceor* In ehlt eata» le would ba datlrabla to 

aulelply all rowt of X by x tlauleanaoualy* aqulvalanelyt eo 

parform an alamcnt-by-alaaant mulelpllcation of X and an nxm 

raaerlx aach of whoaa rows art eha vaceor x>) tba axnand 

fucction can ba utad eo axpand an array along a now dlmantlon* 

0 

axpand(array» dla» nawldx) 

array It aiehar a tcalar or an array* Lae ba eha number of 
dlmantlont of array (zero If array It a tcalar)* dim mute ba an 
Ineagar conteane In the range 1 eo N^l • nawldx la a tubranga or 
eha name of a tubranga type (noeas le la noe a aubranga 
conaeane)* The array la rapllcaead along a new dlmanalon of eypa 
nawldx which la Inaaread before dlmenalon dim * For example* 
given ehe deflnleiont 

var 

x: array [0**7»8**1S] of, Inceger; 
ehe retulc of 

axpand(x*2*5**7) 

la a matrix wleh dlmenaiona [0 * * 7* 5* * 7*8* * 15] in which the value 
of [l|j»k] is the aame for all 3<j<7* 


The last type of array operation which is frequently 
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rcquisftd It eht gtduetioo of to tgrty - Applying to optgteov ovtf 
t att of dintotioot* For Xattaoeii oot might «;iah to toeumulaet 
tht turn along all of eha rova of an array. Radoetion funceiona 
hava eha ganaral forms 

I 

func(array« dim^^, dim 2 > dim^****) 

whara array ia rha array eo ba raduead and aaeh dim^ ia a 
eonaeanr axpraaaion apacifying a dimanaion along which tha 
raduetion ia to ba parformad* (Thaaa ara raquirad eo ba 
conaeanea ao char eha compilar may dacarmina the ehapa of eha 
raaule*) Tha following raduesion funeciona are provldods 


aum ariehmacie aum 
prod arichmaclc produce 
all Boolean ANB 
any Boolean OR 
min arithmetic minimum 
max ariehmeelc maximum 


3.2.5 Control Plow 

Standard Faecal providae several mechanisma for conerolllng 
the flow of execution. The most basic (and often overlooHad) 
mechanism is sequencing - assignment statements are executed one 
at a time, in the order they appear. At a higher level, the flow 
of control may be altered *>y one of the following mechanisms: 


procedures A Pascal program consists of a set of procedures 

and functions (called "subroutines" here for 
convenience). A subroutine call (a procedure call 
or function call ) diverts the flow of control to 
one of these subroutines. When execution of the 


r«p«tleion 


conditional 
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•ubrotttlna la eomplata» eontrol raturna to tha 
atataaant following tha aubroutina call* 

k atataaant (or a group of atataaanta) may ba 
axaeutad aavaral eonaaeutlva tlmaa by using tha 
Paaeal whlla . ranaat-untll « or for eonatrueta* 

For tha whlla eonatruet a Boolaan axpraaaion la 
avaluatad bafora aach Itaratlon of tha controllad 
atataaant(s) s aa long aa thla controlling 
axpraaaion la trua tha rapatltlon eontinuaa* Tha 
rapaat»untll construct parforma tha tast aftar 
aach Itaratlon rathar than bafora It; whan tha 
tarmlnatlon condition la satlaflad (tha Boolaan 
axpraaaion la trua ) tha Itaratlon stops. The for 
statement uses an Index variable; this variable la 
assigned an Initial value and successively 
Incremented or decremented until It reaches a 
final value. The controlled statement (or 
statements) Is executed for each value of the 
Index. 

A statement (or group of statements) may be 
conditionally executed by placing It In the body 
of an statement. If the controlling expression 
evaluates to true the statements are executed; 
otherwise, they are skipped. Optionally, an else 
keyword may be specified, followed by a second 
statement (or block of statements); this statement 
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it •xtcuetd if eonerollliig «xi>r«ttlon It 

ftltt * 

atlteeloQ Out of t att of a&ataaanta (whaaa aaeh 

'^aeaeamane" may aceually ba a block of 
ataeamanta) may ba axacucad according eo Kha valua 
of a conerolling axpraaaion. Thia ia eha caaa 
ataeamant in Paacal* 

''goto'' Tha flow of control may ba diractad to any daflnad 

atatamant labal by uaa of tha goto atatamant in 
Paacal* Tha ability to tranafar control to any 
labal within tha program haa baan eritlciaad aa an 
Impadimant to good program daaign{51 It waa 
includad in Paacal bacauaa of tha lack of a 
ganeral agreamant aa to what ahould raplaca it [6] 
and bacauaa it la occaalonally uaeful for braaking 
out of daeply-nea tad coda atructurea* 

All of the atandard Paacal control flow conatructa ara 
preaant In Parallel Paacal* In order to affactlvaly deal with 
arrays as aggregate entitles, it la necessary to extend these 
constructs to deal with array operations* This extension must be 
carei.ully considered to avoid adding unnecessary complexity to 
the semantics of the language* 

The most basic form of program construction sequencing - 
is essentially the same for an SIMD~class processor (such as a 
parallel matrix proceasor) as It Is for an SISD-class 
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(eonvtaeion*! tCAlar) proetttop* (This eoneape eh«ns«i la an 
NXMD-elatt proeataor, tinea In ehae anvlroniaane aany Inatruceion 
aeeaana may ba almuleananualy proeaaaad*} Similarly, eh# eoneapt 
oC a proeadura or funceion call, and the meaning of a geco art 
unehangad* Thla auggaafa char eha aneanalona eo Paaeal mill be 
bated upon let control atatamaneat caaa . while . rapaat»unell . 
end for * 

The ^ atatomant eeueae the axeeutlon of one (end poeelbly 
two) statamant(e) according to the value of a controlling 
axpreetlon. The execution la ' 'all-or-nothing' ' - either the 
controlled atatement la oxacuted or It la not* Thla la well 
eulted to a acoler machine, but It preaente probleme In Parallel 
Paaeal* It la aometlmaa nscesaery to conditionally perform tome 
aetlone uelng only a eubeet of an array* Parallel Paaeal 
provldee the where atatement to eddreea thla need* 

where atatement hea two forme: 

where erreyexpreealon do 
atatement 

where erreyexpreealon do 
atatement 

otherwlae 

atatement 

where "arreyexpreaalon' ' la a Boolean-array-valued expreaalon 
and ''atatement'' la a Parallel Paaeal atatement* Some 
reatrlctlona apply to the controlled "atatement'': 

e A goto out of the where or between the two controlled 


atatemente In a where la forbidden* (These restrictions are 


oweiNAU PA® w 
OF POOR QUAUTX 

lapoii 4 to CoelHtoto tho ioploaontotioa of whoro if f taonto 
wleh • eonditiofiAl ttaek; uncoatfoliod 000 of tho goto 
eonplieotoi oueh an Implamantatlon*) 

a Aetap variablaa which appaaf on tha left-hand aida of an 
aaiignmant atataraant must ba typa-eompatibla with tha 
controlling array axpraaaion* 

Tha axacution of a whara is dafinad as follows* First» tha 
controlling axprassion is evaluated to obtain a Boolean array* 
Next, the first . controlled statement (referred to later as the 
whara clausa ) is evaluated* Array assignments are masked 
according to the Boolean array computed above* Finally* if there 
is a second controlled statement (an otherwise clause ) . it is 
evaluated* Array assignments within the ''otherwise clause' are 
masked by the inverse of the Boolean array computed in the first 
step * 


where statements may be nested* provided that all of the 
controlling array expressions are type compatible* The effect of 
a where statement is local to the procedure or function in which 
it appears; that , it does not affect the execution of any 
procedures or functions called from within a "where clause'' or 
"otherwise clause"* 

The where statement provides Parallel Pascal with 
conditional assignment (or masked assignment ) * That is, all 
array expressions within both the "where clause" and 
"otherwise clause" are fully evaluated* but the results are 
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only •••Ignad to o tubtac of etio ottoy appoaring oo eha lafe-hand 
aida of an atilgnmant staeamanf* Thio allows eha spacif ieaelon 
of many eoaoon problamsi foe inaeancat *'61vao two aeeaya A and B 
(of tha aaoa typa)» dataemlna tha aanimum of A and B alanant-by 
alamant and atoea tha eaault In A.'' This la achiavad by tha 
statamant : 

whata a < b do 
a b{ 

An altarnativa to conditional aaalgnmant la conditional 
avalnatlon * A conditional evaluation achaiDa would cauea the 
evaluation of all array axpraaalona to be maekad (alamant by 
element) by the controlling expraaelon* This could be used to 
catch exceptional conditions; for Instance, divide by sarot 

where a <> 0 do 
a :■ 1/a; 

While conditional evaluation provides some additional 
capabilities that conditional assignment does not, It Introduces 
semantic difficulties* One problem which conditional evaluation 
raises Is the treatment of function (or procedure) calls from 
within the where statement* If an array expression Is passed to 
a function, what values are passed for those elements for which 
the controlling expression Is false? Similar problems arise with 
the use of standard functions which alter the shape of arrays ** 
at what point Is the masking applied (for at that point the 
expression must be type compatible with Che controlling 
expression)? The presence of these problems with conditional 
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•vftltt«eion And eh« 7«l«eiv« ••aaneie siaplleicy of eondleion«l 
Aiolgnaonf lod Co 6ho laeeor'o eholeo for cho whoro conoeruct* 

Tho dotign of eho whoro ifocomone oo o porollol oxeonoioii of 
6ho ^ oeoeoaont lod eo eho eoooldoroeioii of o porollol oxtonsion 
of eho Poteol eooo otoeomone* Tho eooo ttotomonc ooloeto oao 
•totomone (or bloek of ototomofieo) from oovorol deponding upon 
tho voluo of o eontrolllng oxprooolon. It dlf foro from tho U, 
•totomont in thot tho eontrolllng oxprooolon it multl-voluod 
rathor than Booloan; honco, a vary largo numbor of altornotlvoo 
may bo ooloetod* It woo folt that a parallol vorolon of tho caoo 
statomont would bo uood Inf roquontly ; in tho Intorost of hooping 
tho oizo of the language to a minimum it wae thoroforo omitted 
from Parallol Paaeal* If neeosoary, the offoet of a parallol 
caoo statomont can be achieved through the use of a series of 
where statements (in the same fashion as a standard Pascal case 
statement can be implemented by a series of statements)* 

The only remaining control constructs to be considered are 
the loop structures while * reoeat-until » and for * The loop is 
one of the biggest sources of error for programmers; therefore* 
adding complexity to the looping mechanisms seemed unwise* It is 
unclear how a for loop should be extended* Further, a 
combination of a standard Pascal while or repeat-until loop 
statement (perhaps using a reduction function such as ''any'' or 
''all'' to use conditionals based upon entire arrays) and a where 


statement can express all of the operations that any new loop 
construct of moderate complexity could express* 
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3.2»6 lapue and Outpue 

Pateal provldat a fairly mlaimal tac of lapue aad ouepue 
proeadurao* Bach flla eoaiioeo of a ualferm oaquanea of objaceo 
of a flxad eypa* Tha flit is aeetiaad by oaaai of a buff at 

varlabla"* Syataoeteally, flla buffar varlablat ara uoad la eha 
■aao faihlon as polatars* To parform outpue » eha data la placad 
Into eha flla buffar aad eha *'pue'' proeadura la eallad. To 
porform lapue, eha flla varlabla la road, afear which eha ''gae'- 
proeadura la eallad eo advaaea eo eha aaxe learn la eha flla* Tha 
*'aof'' fuaetloa may ba uaad eo daearmlaa whaehar a flla la 
posielonad at eha aad* Tha '^rasae'* proeadura rapoaleloaa a 
flla ae eha baglaalag and aakaa le avallabla for raadlng, whlla 
eha ''rawrlta'' proeadura raposleloas a flla ae eha baglaalag 
afear eruacaelng It, and aakaa It avallabla for writing* 

In addition to tha *'gae" and ''"put'' proeaduraa, tha 
''read" and "write" proeaduraa may ba used* Given eha flla 
"f" and variable "x" (both having tha same type) eha 
following aqulvalaneas hold: 

read(f,x) 3 x :« ft; get(f) 

wrlee(f,x) § ft :■ x; pue(f) 

Files whose eleaanta are of type "char" (j^*a* those of 
type "text") are treated specially* The procedures "read" 
and "write" may be used to transfer numeric data to or from a 
text file - the appropriate conversion Is performed. In 
addition, Che procedures "readln" and "writeln", and the 
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funceion "•oin" tr« provl4«d for locolllgone Hondllag of lino* 
forooeeod Inpue* 

Tho uto of toxt filoi for aooo Informoelon Input ond output 
on ft pftrftllfti procftftftor wftft conoldfttftd highly unliholy* 

Thftrftforft* it waft deeidftd that Parallftl Paacal noftdftd no 
additional proviaiona for daaling with toxt filaa boyond thoaa 
providod by atandard Paaeal* 

On tha othar hand* it waa apparant that binary format" 
input and output would ba haavily uaad* In particular » tha 
limitad main memory of a matrix proeaaaor implias that a graat 
daal of data movamant will ba parformad during tha axacution of a 
program* Thia aubjaet falls in a "gray araa" batwaan tha 
apaeif ieation af tha language and its implamantation» for tha 
manner in which the main memory of the parallel processor is 
managed directly affects the type of input and output required* 
For these raasonSf it was decided to retain standard Pascal input 
and output without extensions for the definition of Parallel 
Pascal* The facilities which are required for memory management 
can best be determined after a period of use* Additional 
standard functions (which can be added to the language without 
significant trauma) could be added at a later time if a definite 
need arose* (Another, less desirable possibility, would be the 
inclusion of some atandard procedures on a site-dependent basis. 
This is In fact likely in other areas of Parallel Pascal, £*j^* 
the implementation of interconnection functions which are more 
complicated that the simple mesh network defined for Parallel 
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Pascal, but which ara aupporead bp a parCleular aachlna*) 

A diacuaaioa of ioaa of eha poasibla tnpue and ouepue 
faelllciaa for managing a limicad memory ic praaancad in Section 

S. 
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4t PARALLEL P«CODE 


la ehit saetion eh« davalopaane of Paralltl P-eodo It 
dtteribtd* Apptadix B eoaftlnt t eospltet dttevlpeloa of 
Ptrtlltl P-eodt* 


4* I Pttudo-codt 

Tht coaetpe of pitudo-eodt (*'P-eodt''} wtt laeeoductd bp 
Urt Ammtna wieh eht portablt Pttetl P4 eomplltrCl]* P-eodt it t 
tlmplt* fixtd foraat Itoguagt etprtttneiag eht ttttmblp Itogutgt 
of t hypoehteietl tttek eompuitr ( ' *P-maehlot" )• Tht low«Itval( 
Implaatneaelon-dfiptadtae daeallo (a«£« tha Intarnal 
rapraaaneaeioa of eha varloua daca eypaa) art noe apaelflad* Tha 
operators la P-coda ware ehosan to cloaaly raflaet tha 
arehltactura of eontaaporary computer ayatams; hence » code 
generatore (to convert P-code to native machine code) can be 
conatructed fairly easily* Altarnatelyi the P-code can be 
converted from Its symbolic form to a binary form and executed by 
an Interpreter* 

The structure of P-code reflects the structure of Pascal* 

The P-machlne upon which P-code runs Is a stack-oriented 
computer* All procedure activation records are maintained on the 
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•6«ek| pavaieeiag aeeaas eo variablai aceordint to ehaiff laxleal 
laval (aeaele naaelng) aod offaae wiehia eha procadura 
aeeivabioQ* Tbasa aea no dlraeely-aeeaaaibLa ragiaeata* 
Xnaeruceiona and daea ara coaplaealy aaparoea» aad all daba 
naaory locabloaa (wtiaehar on eha aeacfe or eha haap) ara aerongly 
rypad* Tha P*eoda laaeruceioo aae loeludaa inaerucelona eo ebaek 
array bounds* 

Sinca P-eoda waa ineroducad» aoaa varlanea hava baan 
davalopad* Rafaranca Z eoaparaa standard P-eoda (also eallad P-4 
P-eoda) wish varlanea davalopad ae tha Unlvaraley of California 
at San Dlago and ae Loa Alaaoa Selanelfle Laboraeorlaa* Thas«» 
varlanea wara ooelvaead by loplananeaelon naada* In eha eaaa of 
UCSD, a afflelane and eoopaee form waa naadad for axaeuelon on 
mlero- and mlnl-eompueara wieh a llmlead addraaa apaea* In eha 
eaaa of LASL, axeanalona wara naadad eo fully uelllea eha eargae 
maehlna (a CRAY-1) and eo Inearfaea wleh Foreran programs* 

An alearnaea Inearmadlaea language form whleh waa also 
eonaldarad was a eraa-aerueeurad languaga* Ona axampla of ehla 
form la *'T-eoda"» a languaga daflnad by eha Syaeams Rasaareh 
Group of eha Unlvaraley of Illinois [3] for uaa In anoehar 
compllar* T-eoda la a dlracead aeyelle graph basad upon Paseal 
axpraaslon eraaa* Tha raprasaneaclon of Paaeal axpraaaiona as 
eraaoi raehar ehan ae a llnaar aaquanea of P-coda Inaerueeioiia i 
ean facllleaee opelmizaelon and coda ganaraeion* 


Slnea eha P4 compllar waa aelacead as eha baala for 
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iapl«a«oeias • ftralltl ftteii eottpil«e» ie d«eld«d eo d«fiat 
ao ia&aratdlae* laaguat* b«s«d upon 3*4 P*eod«* Whila a eoapiiafr 
iftileh dieaeely produaad an iaearaadlaea lanfuaga atteti aa t*eoda 
waa eooaldaradt eha alaplieley of a P-eoda-llka languagOi eha P* 
eoda aaeura of 6ba P4 eoapilari aad eha face ehae eha Xlltaoia 
T-*coda gaoaeaeor aoeapead P-eoda aa lapue awayad eha daeiaioa la 
favor of eha aora eoavaaeional paaudo-eoda foraae* 

Alehough a P-eoda foraae was ehosaa aa eha laearaadlaea 
laaguagOi aeandard P*eoda waa Inadaquaea eo rapraaane eha daea 
■eruceurtag and aggragaea oparaeloaa ehae ara raqulrad for a 
parallal aaerla proeaaaor* Thla lad eo eha davalopaane of a nav 
Inearmadiaea laaguaga* baaad upon eoavaneloaai P-oodOt callad 
^'Parallal P-coda"« Tha following aaeeiona dateriba eha 
davalopaane of Parallal P-eoda* 


4.2 Daea Tvoaa 

The moae elgnlfleane dlffaranea baewaan aeandard P-coda and 
Parallal P-eoda la eha way in which ehay eraae daea eypaa* In 
aeandard P-eoda» only a few daea eypaa ara aupporead - ineagari 
raal» Boolaon, eharacear* aae» and poinear* Thaaa ara auf f icianc 
eo perform all oparaeiona in aeandard Paaeal , bacauaa Paacal 
daala wieh data on an alamane-by-alamane baaia. Parallel Paacal » 
however » parmlea (and in face encouragaa) eha manipulaelon of 
arrays aa aggragaeaa. Tha problem of array aggragaea oparaeiona 
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e«a b« tolvad io oo« o£ two v«ya* 

Tht flrte al&tena&ivi it eo dtflat t tmtll auabtv of now 
do&o eypot eoptotonting &bo otroy eypot ehoe eho oottin peoeottoe 
eon tee upon dltoeely* A toe of fundoaoneol opotoeiont would bo 
dofinod upon tboto botio doto eypot* Lorgor opoeoeiont would 
ebon bo oxplicely ''untollod'' by eho coapllof into o toquoneo of 
ehoto fundomoneol opotoeiont* Thit oppirooeb potitinnt eho 
ineormodioeo longuogo oe o low lovol; tho ptoudo-codo it noorly 
eho ottombly longuogo of tho torgoe moehino wieh eho tyneox 
"eloanod up"* 

Tho toeond oltornativo it to evoae all oporaeiont at a high 
lovol* Raehor than having a finito toe of fundaaoneal eypot* ehe 
coapilov would dofino now daea eypot and would opoeify ateay 
oporaeiont wieh a tinglo interuoeion inteoad of an unrollod 
toquance* Thit approaeh potieiont ehe ineormodiaco language ae a 
much higher level ehan ehe firae approaeh* 

Of ehe ewo sehemes* ehe firte meehod requires more worh by 
ehe compiler ''frone end" and very lieele work by ehe ''baek 
end" (code generaeor)* The second meehod requires a great deal 
more of ehe code generaeor* However ^ ehe intermediaee language 
for ehe second meehod is much more machine-independene, and wieh 
ies higher informaeion coneene ie facilieaees opeimizaeion* 
Parallel P«code is designed according eo ehe second approach* 

The base eypes defined In Parallel P-code are very similar 
eo ehoae In seandard P-code: Ineeger, real, characeer, and 
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Booltftn* Wieh tht appeopriae* daflaieion •eaeaaaaea* ehaaa bata 
typaa ara ttaad eo dafina all aerucburad eypaa* 

la &ha followlag diaeaaaloat eha daflnleion aeaeamaaea ara 
rafarrad eo aa ''paaudo-oparaeora" or ' *paaudo*opa" ' alaea ehalr 
rola In Parallal P-eoda la vary •Imllar eo eha rola of ptaudo- 
oparaeora In a eonvanelonal aaaaably languaga* 

4*2«1 Sttbranga Tvpaa 

Standard P-eoda uaae objacea of type "integer" to hold 
valuea of a aubrange type* Uhlle thlr« la aultable for a 
conventional word-oriented aaehine» a bie-addreeaable machine 
(auch aa a bit-aerial matrix proceaaoc) can utilise memory more 
efficiently by only allocating the minimum number of bits needed 
to repreaent all value^^ within the aubrange* Parallel P-code 
provides the * RANGE pseudo-op to declare a subrange; for example, 
the type "rng" can be defined to be the integers from I to 3 
with the statement: 

* RANGE rng,l,S 

The base type for a subrange is always "integer"* As in 
standard P-code, integers are used to represent user-defined 
scalar types* There Is no provision for a subrange of characters 
- the standard character type is used Instead* 
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4*2*2 8»e Typ»« 

Xq scaadard P-eoda ehaea la only ona eypa for saeo* Tha P4 
eoaplXaf ioplamantaeion noeat[l] raeommand 6ha uaa of a bleaerlng 
10 iapXamane a aae* Llalelng eha aai eo ona Eapeaaaneailon 
raaerlcia let ganarallcy In two waya* Flrati eha maxlmua nuabar 
of alaoaaea la eha aae la flxad* Saeoad, eha ranga of eha 
alamanea ehamaalvaa la raaerlcead* Thae la. If ehara ara nutaaae 
poaalbla alamaaea, ehan ehay ara eepraaanead by eha Ineagara 0, 

1, ••• numeae*l« laeegara which fall ouealda of ehla ranga 
caanoe belong eo a sec* 

Parallel P-code permlea ehe deflalelon of a poweraee eype 
wleh ehe > 8ET paeudo-op* For example, ehe eype "peel" can be 
defined eo conealn ehe Ineegera from 5 eo 10 wleh ehe aeaeemene: 

•SET paee,S,10 

The baee eype for a poweraee la alwaya ^'Ineeger"* Aa In 
seandard P-.code, Ineegera are need eo repreaene uaer-deflned 
scalar eypea* 

Parallel F-code does noe define ehe formal ehae a poweraee 
la eo have; inaeead, le la lefe eo ehe implemeneaelon* However, 
le la occasionally necessary eo specify a poweraee conaeane* The 
conaeane la specified by ehe Cype of ehe sec and ehe elemenes; 

ehe poweraee conaeane 'M3, 6, 9]'' of eype "paee'' would be 
represeneed in Parallel P-code as 
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Xe it a«e«ttary to •p«ei£y eh« eyp« a«c&« «• wtll a§ eti« n«mb«rtf 
di££«r«nc taet oi«y hftv« dlffsecnt lapI«B«a6a&iont* 

4.2.3 Ptiaa 

Tba P4 coapllar only poraiet £llot eo bo of typo ^*toxt'S 
thoe ** £tIo of ehor". Thus* thoro lo no nood to diotlnguioti 
filoo of difforont typot In iCandavd P*coda. Parallal P*coda 
pvovldaa tho . PILE paaudo-oparator for apaelfylng tha typo of a 
filo. Tha syntax la Intultlvas eo dafina ^'ftypa'' aa a fila 
with alamanta of typa ''atypa^' tho atatamant lat 

.PILE £typa»atypa 

4.2.4 Array Typea 

In atandard P-coda» almoat all operations ara parformad on 
acalar elaaanta. (Tha axcaption eo this rula la a provision for 
moving blocks of data from one place to another.) Parallal 
Pascal » however I requires operations to be performed upon arrays 
as aggregates. As discussed abovoi the decision was made to 
provide a formalism for specifying these parallel operations In 
the Intermediate language. 

In order to process array operations, the code generator 
must know at least the size of the array and the type of 
elements. For more sophisticated operations (£•£,• operations 
involving only a subset of the array) It must also know the 
layout of the array the number and range of array dimensions. 
This Information can be divided Into two portions, static and 
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Th« stAtte portion y«pr«AAnet InfornACloa etiA& it known nt 
coapllo ciao* 16 eontlttt of tneh thingo no tho knot typo (!.••• 
th« typo of tho or toy oIonontn)t tho nuabov of dlaonolonti ond 
tho low ond high bounds of oooh diaonolont This portion eon bo 
eonoldorod tho loalcol opoeif leotion of tho doto. 


Tho dynoale portion of an array typo conoloto of tho addrooo 
of tho array and tho apoclf Ication of which oloaonta oro to 
partletpato In an oporation. Thio portion thoroforo roprooonto 
nhyalcal opoeif leotion of tho data - whoro It lo otorod and 
what portlono of it (£•&• which array oloaont> 2 ) oro to bo 
of fioetod* 


Tho atotie ond dynoale Information lo eolioetlvoly roforrod 
to ao an array doacriotor » Tho parallel languagoo ALAUl and 
LRLTRANIS] alao eontaln array doanriptorai but thoro are aovorol 
aignlfieant dlfforoncoa botwoon thoao doaeriptora and Parallel 
P*eodo doaeriptora* Tho doocrlptora in ALA and LRLTRAN are 
uaer-aceoaolblo • while tho dooeriptoro doflnod hero are not 
directly uoor-accoaoiblo* Parallel Paacal eontalna no concept of 
an array dooeriptor} they are defined only In tho Parallel P-codo 
implementation* Tho aiao of tho data roforoncad by an ALA or 
LRLTRAH doacrlptor may bo varied by tho uaoc; Parallel Paacal 
doaeriptora by contraat always refer to data whose aiao is fixed. 
(Both types of doaeriptora allow selection of a r'.ubaot of tho 
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•l«a«nei to which they tefor*) 

The itctle portion of cn army datcrlptor Ic apaelflad In 
Parallel P-eoda via tha * ARRAY paaudo-oparator • Tha baaa typo 
(l*a« array clamant typa)» number of dlmanalona, and range of all 
dlmanalona am apaelflad* For Inataneoi tha array type defined 
by 

[1**S»2**61 of, Integer; 

would be defined In Parallel P-code with the etatementt 
•ARRAY arr|lnteger|2,l|5|2,6 

An array la never defined In terms of another array; thus, 
the following definitions: 

row ■ array [I*. 5] of, real; 
mat • array [4. *81 of row; 

will be translated to Parallel P-code as: 

• ARRAY row, real 1 1 1 1 , S 

•ARRAY mat, real, 2, 4, 8, 1,5 

Parallel Pascal provides the parallel reserved word for 
declaring that an array should be allocated In the parallel array 
memory rather than the sequential control unit memory. If an 
array Is declared parallel . this fact is reflected in Parallel 
P-code by a negative rank. For Instance, 

arr • parallel array [2. .4,8. .16] of. Integer; 


Is translated to 
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•ARRAY Are,lat«g«e»*2»2»4»8»16 

Th« dyQAoie poreioa of array daacrtpeora will ba daal6 wirh 
In mora daeall la a latar aacelon* 

4.2.5 Racord Tvpaa 

Xn ordar for arraya of raeorda to ba Intalllgantly 
proeaaaadi It la naeaaaary for tha Interiaadiaea languaga to 
daflna daeerlptora for raeorda* aa wall aa arraya. Lika array 
daaerlptora* racord daaerlptora conalat of a atatic and a dynamic 
portion. The atatic portion apeelfiea the record; the fields and 
their types. Th'!; dynamic portion specifies the address of the 
record and the field which has been selected for a particular 
operation. (Unlike arrays* It Is not possible that more than one 
field In a particular record will be simultaneously selected. 

This property Is a result of the choice of the array* rather than 
the record* as the data structure used to express parallelism* as 
described in chapter 2.) 

Because the structure of a record is not as regular as the 
structure of an array* a single type definition statement for the 
static portion of a record would be cumbersome. For that reason* 
Parallel P-code defines records according to the fields which 
they contain. The pseudo-operator used to define record 
components is .RECORD. One .RECORD is generated for each field. 


Parallel Pascal* like standard Pascal* permits variant 
records. When a record has variants* several components will 
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•h«r« eh« sata* namory alloeaelon* (Only ona it ia uia ae any 
glvan tlma«) Parallal P*eoda parmiea eha spaclf Icaelon of aa 
offaae with aaeh flald daelacation* A racord dafiaitloa la 
Paeallal P*eod« eoaslots of a saquaaea of t RECORD teataaoutB* 
NormaXlyi aaeh aueeaaaiva flald la eha aama caeord la aaalgaad a 
aaquaaelal loeaeloa la mamory* Howavar» ehla hahavlor eaa ba 
ovarriddaa ao ehae a field la allgaed ae eha aame offaae ae a 
prevloua field* 

The geaeral eyaeax of ehe . RECORD pseudo-op la 

• RECORD mama, f name I off see, f type 

where *'raame^' la ehe name of ehe record being defined, 

**£name'' la ehe name of the field being defined, ''feypa'' la 
ehe type of ehe field, and ''offset" la either "nil" or ehe 
name . of a previously-defined offset* If "offset" Is the 
literal string "nil", the next sequential memory location Is 
assigned; otherwise, the new field "fname" Is aligned with the 
existing field "offset"* As an example, the record defined by: 

rec • record 
x: Integer; 
y: real; 
case Boolean o£ 
false: zf : Integer; 
true: zt: real; 

would be translated to 

•RECORD rec,nil,x, integer 
•RECORD rec, nil, y, real 
•RECORD rec,nll,zf, integer 
•RECORD rec, zf,zt, real 




.,.4>».:...^;:;:m!S*£mm 


§7 ORIGINAL PAGE IS 
OF POOR QUALITY 

4«2*6 Th» Dyoamie Pogfelon of D>tcglpfcoe» 

la avdar ea axaalaa tht tpaeiCieaciaa a£ eha dyaaaie pareiaa 
af array aad raeard daaertpearff ie la nacaaaary to flrae 
eaaaldar eha way la which ehay ara ea ha aaad* 

Aa dlacuaaad abavci aeaadard P«*eada la eha aaaaably laaguaga 
a£ a hypathacleal aback caoputar* Parallel P-cada waa alaa 
daalgaad with thla gaaaral phllaaaphy* All apara&laaa ara 
parfarmad by maana of a raa^elma aback # Daba la laadad aaeo bha 
bop of bha aback# mckilpulabad on eha aback# and aeorad from eha 
bop of bha aback# In abandard P«cada# daba la manlpolabed In ana 
of bwo waya# Tha flrab way la bo load bha daba onbo bha aback 
and manlpulaba lb dlracbly# Thla la bha moab comman mabhod (In 
abandard P-coda) and lb worka wall bacauaa Paacal uaually daala 
only wlbh ona Ibam ab a blma# An albarnaba way la bo perform a 
daba branafar of a eomplla-blma apaelflad number of alamanba 
bebwaan bwo addraasea which ara compubed ab runblma# In bhla 
second case (uaed In aaalgnmenb ababamanba where bobh aides are 
Idanbieal arrays or records)# bha addresaas# nob bha daba# raalda 
on bha aback# They could be called vary simple daacrlpbors 
becauaa bhey describe where eha referenced daba la (or la bo go)# 

lb seems reasonable bhab Parallel P-code should also maka 
use of bhese bwo mechanisms# When an operablon Is performed on 
scalar daba# bhe daba Ibaalf is loaded onbo bhe runbime aback# 
manlpulabed# and sbored from bhe aback# When an operablon 
Involves an array or record# or some comblnabion bhereof# bhe 
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•teoad laathod It ctlltd for* Bewtvor» btetutt PtrtlltX Pttetl 
providto aort fltxlbllley la tggrtgtet optrtelont» tn tddrttt 
tloat it aoe tuffieltae* Unlikt eht tetadard P-eodt ettt* which 
Involved t eyptlttt aovt of t cootteutlvt block of dtet from oat 
tddrttt eo taoehtr* Information mute bt provided about the thtpt 
end type of the date* The type Information It tupplltd by the 
ttttlc dtterlptor (i;*t* by an » ARRAY or * RECORD ptoudo-optrator ) • 
The runtlmt-dtpandant thapt Information It provided by the 
dynamic dttcrlptort on the runtlmt ettek* 

Dynamic deterlptore on the runtime ttaek In Paralltl P*codt 
art mote tatlly undorttood when contldortd rtcurtlvtly* Bach 
Itvtl of etructurlng It applied to a detcrlptor formed at a 
higher level* Before exploring thlt concepe eompletelyi an 
examination of the format for array end record deterlptore It In 
order* 


The runtime r^ature of an array It determined by two dynamic 
attributes: the addreea of the array and the Index ranges of its 
dimensions* The dynamic (physical) portion of the array 
descriptor which resides upon the runtime stack specifies these 
attributes* This Information Is constructed by loading a 
''blank" descriptor (one which specifies the array address but 
does not specify Index ranges) and then "filling In" the index 
ranges using one of three operators: IXO (select entire Index 
range)* 1X1 (Index by a scalar)* or 1X2 (Index by a subrange)* 
Each successive index instruction is applied to the next 
unspecified array index range* Note that the compiler does not 
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kao«r or eoro wh«e In whoe foraor eho dyaoale array daaeripror la 
apaeifiad* 

Tha con cape 6hae indaxlng by a aealar la to raduea tha rank 
of eha array (a*£* a eoluao of a aatrlx la eonaldarad eo ba a 
vaeeor) raqulraa axera aetantloo* Tha ataeie eypa of eha eop* 
of*aeaek la ehangad by aealar Indexing* Thle rapraaanta eha 
logical eypa of eha daea* Parallel P*eoda doaa noe opacify eha 
Impace upon eha dynamic porelon of eha daacrlpeor» which 
Indleaeao eha phyoleal aeerlbueao of eha object* In the 
hypothaeleal machine which Implemenee Parallel P-code, the 
dynamic deacrlpeor etlll epeclflee eha phyoleal memory aoooclaeed 
wieh the array* even ehough Che type of ehe array hao changed* A 
code generator (which doeo not actually olmulate a runtime otaek) 
muet elmilarly "remember'' the phyoleal orlglno of an array 
whoa'e logical ohape hao been altered by aealar indexing* 

In eontraot wleh arrayo* only one component of a record may 
be specified ae a time* However* unlike arrays* ehe fields In a 
record are non-homogenouo* The manner In which ehe target 
machine otorea ehe fields of ehe records will affect how a record 
field Is specified; ehe compiler cannot simply calculate a 
constant offset (as is done In standard P-code)* Word sizes 
differ between machines - one machine may store both Integers and 
floating-point numbers In the same size word* while another may 
require several units of storage for a floating-point number* A 
further complication Is Introduced by the architecture of the 
Intended target machine (a parallel matrix processor)* because It 
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will ttwttwlly eonealn ewo a@a*idaneieal naaorlaa for •ealar and 
array daea* Ona of tha daalgn goala of eha in&araadiaea languaga 
waa 60 ba ralatlvaly iaplaaaaeatioifladapaodant • In addi6ion» 
ehara waa a atroag daalra to kaap Parallal P-eoda at a high laval 
of abatraetion to aiapllfy tha ''froot and'' and ratain aa aueh 
■yabolle information aa poaaibla for tha ''back and" to uaa for 
optimisation and coda ganaration. Tharaforap all racord offaata 
in Parallal P*eoda ara mada by maana of symbolic namaa* Tha 
namaa eorraapood to tha field namaa dafinad in > RECORD 
atatamanta* 

Tha exact format of a record daacriptor ia not known to the 
'^front and"* Xnataadp tha racord daacriptor ia conatructad 
with the aid of tha "aalact" (SBL) instruction* A daacriptor 
that apaeifiao tha entire record ia loaded onto the stack; this 
ia similar to the ''blank" descriptor described above for arrays 
but may be used without further modification to access the entire 
record* The 5EL operator is used to select a field from the 
record* This replaces the record descriptor on top of the stack 
with a modified descriptor that indicates the address of the 
record and the selected field* If that field la itself a record, 
another SEL is then used to select a field within that aub- 
record* 

The SEL operator, like the IXl operator, changes the logical 
type of Its operand from a record to a record field. As with the 
array case, the dynamic descriptor will still contain information 
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•boue eh« phytic*! ceoctgt tctoeiatcd »16h eht oti9 logic*! eypt* 

D**crlpCor* for mor* comp!** ocruceur** ((••|^« *rr*y* of 
r*cordt, *rr*y* uiehla record*) *r* coa*cruce*d by r*p**e*d 
*pp!lc*cion of Ch* l*cholqu*t *bov«* For ln*e*nc*» glvco eh* 
followings 

*rr*cs *rr*y Cl*tSl of 
r*cord 

xs *rr*y UttlO] of, Inctgtr; 
yi Inecgtr; 

•nd s 

a d**crlpeor for ***rrtc(@l*«2I»x(ldO*«2] " would be contrrucetd 
by eh* following a6«pts 

!• Lo*d * '*bl*nk d**erlpror'' (which •ptelfltt ch* cddrttt but 
no lnd*x r*ng**) of '^errte" onto the runtlm* *t*ck» 

2« Load th* conttane 0 onto the stack* Perform an 1X2 
operation using the subrange ''1**2''« The stack now 
contains a descriptor for an array of records* 

3* Perform a 8BL to select the field ''x" in the records 
described by the descriptor on the stack* The stack now 
contains a descriptor for a two-dimensional array* for which 
the first index range has been selected as 'M**2'' and the 
second is (as yet) unspecified* 

4* Load Che value of '^i^' onto Che stack* Perform an 1X2 
operation using the subrange '*0**2''* The stack now 
contains a descriptor for a 2x3 array whose dimensions have 
been selected as "1**2" and "^**(^+2)''* 
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OKOI^tSSSm 

4«2.7 Polae»g> 

la •eandard Pmteal, a polneag It tlaply ebt tddrttt of t 
dttt leta* Tht polaetg eta bt eopltd tad eoaptrtd» bub let vtlut 
etoaob obhotuitt bt tfftebtd by bbt pbosgtarntr* Ptgtiltl Pttetl 
pgovldtt tht ttat tyabollta for tptelfyiog poiottre that attodtrd 
Pttetl dott* Bowtvtr» tht iapltatatteloa of t polattr at tiaply 
to tddrtti Uaite let uttfulattt la Ptrtlltl P-eodt* 

At tht prtviout teetioa diteuttod« tht dyatale portioo of ta 
array dtterlptori t rteord dtterlptor» or t hybrid of both» 
eoetlttt of ta tddrttt tad laforattloa about uhleh dlatatloot (or 
fltldt) art aoltettd* Oaet thlt laforattloa hat btto eoattruetad 
oa tht ruatlat tttek» It eta bt uttd at to tddrttt for P-eodt 

I 

oporatloot (for txtaplt» lotdt tad ttortt)* Horatlly, tht 
dtterlptor It uttd la tht proettt of atalpulttlag tht dttt It 
dtaerlbtti but at tlatt It la ntettttry for tht dtterlptor Itttlf 

t 

to bt aanlpultttd* (Thttt optrttlont art eoapllar-gtntrtttd » 
tinea Ptrtlltl Pttetl dott not provide tht eonet:^. of t 
dtterlptor*) For thlt rttton, Ptrtlltl P-eodt lapltatott til 
polnttrt at dtaerlptort* Dtaerlptort of tetlar data art tiaply 
tddrttttt; htaeti for tetltrt tht eoaetpt of t pointer it 
unchanged* 

Dtaerlptort (polnttrt) art defined la Parallel P-code with 
the ''*POIHT" patudo*optratlon* For lattancti to define type 
''abe'' at a pointer to type "xyt" the stateaent would bts | 

* POINT abcyxyz 

I 

' J 


1 


Oecatioa«lly» to oicptdito ptoeooiiag by tbo eoapilor froot 


•nd, It it eoavtnitnt to roftt to two idoatieoi typtt by 

difftrtnt atatt* Partlltl P->oodt providtt tbt **«XypB" pttudo*- 

« 

optrttloa for ttilo purpott* Xht ttataaont 
•TYPE x8X»yyy 

deflnot typo "xxx" to bo tho tamo thing at tho alroadydof load 
typo "yyy". 


4*3 Momory Allocation 

Xhoro are two typot of variabloo in a Paoeal program “ thooo 
which are allocatod on the runtime etaek and those which are 
allocated dynamically from a runtime heap* Xhe former corroapond 
CO Che ordinary variables declared by a subroutine (function or 
procedure) - they are automatically created upon subroutine entry 
and automatically deleted upon its exit* Xhe latter correspond 
to Che pointer variables - the pointers themselves are allocated 
upon entry to a subroutine but they reference memory which is 
allocated by the procedure "new'' and released by the procedure 
"dispose"* Like Pascal* Parallel Pascal uses both types of 
memory allocation* 

In standard P-code the local variables for each subroutine 
are allocated on Che runtime stack by reserving a consecutive 
block of stack memory* A special Instruction* ENX* specifies Che 
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number of orgumonee eo eho fubroueifio oad eho numbor of oomoey 
unieo eoquirod foe loeol voelablto and eaapoeaey teoeaga* Tha 
eompilae which peodueaa tha P-coda *'^knowa'' tha aamoey 
eaqulramanta of aaeh typa of vaeiabla; thua» it can caleulata tha 
offaat within thia eonaaeutiva block of aach local vaeiabla 
containad thaeain* In tha caaa of an aeeayi tha alaoanta of tha 
aeeay aea atoead eonaacutivaly in eow-majoe oedae; tha compilae 
can coaputa tha addeaaa of any alamant aceoedlng to tha uaual 
f oenula* 

A typical Paacal peogeam will contain a main peogeam and ona 
oe moea uaae-dafinad functiona oe peocadueaa* Bacauaa Paacal la 
a block-ateuctueed language, peoeeduee and function definitions 
aea naatad ; that ia» the definitions of some aubeoutinae will be 
containad within the main peogeam, and some of these subeoutines 
will themaelvas contain the definition of othee subeoutines* 

This is eefeeeed to as the static nesting of the peogeam* Each 
peoeeduee is associated with a lexical level * The outeemost 
block contains the main peogeam and the global vaeiables; these 
aee located at lexical level 0* If a function oe peoeeduee 
definition contained inside a block at level i, then that 
function oe peoeeduee is at level i-t*! • Functions and peoceduees 
at level i can eeference all of the vaeiables and invoke all of 
the peoceduees and functions defined in the i-1 containing 
blocks * 

Pascal peemits eecuesive function and peoeeduee calls. Each 
time a function oe peoeeduee at level i is called a new set of 
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Ioe«i vari«bl«t la allocatad* Thuai wha « fuoeeioQ or proeadura 
ae laval 1 aceaaaaa a vartabla ae laval j it la accaaalng eha 
varlabla eorraapoadlng eo eha moar racane aa& of allooaeiona at 
laval j. Ualika tha atatie naaeiog, cha aaquanca of mamory. 
allocatioaa* eallad eha dvaamlc chain # will vary ae runeima# 

Corraapooding wieh aaeh eallad function or prceadura ia an 
araa on tha runtima ataek eallad eha atack frame (or activation 
record ) « In addition to tha argumanea to the function or 
procedural eha local variablaa, and spaea for temporary raauleai 
tha stack frame includes soma linkage information*’ In standard 
P*coda this includes the return address » space for a returned 
function result (this field la unused for procedures)! and two 
locations for the static and dynamic links* Tha static and 
dynamic links point to the appropriate previous stack frames* 

The hypothetical machine which implements P-code contains a non- 
user-accessible register called the ''frame pointer which holds 
the address of the current stack frame* 

Because of the dynamic nature of the memory allocationi it 
is not possible to compute the absolute addresses of any data 
(except for variables in the outermost - global - block)* 

Insteadi the desired locations are obtained by using a two-level 
lexical-level addressing scheme* The form of a lexical-level 
address is 

(level, offset) 


where 


"level'' is the static nesting level and "offset'' 


is the 
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oi€»mt of tho variablo ralaelva to tha baglnnlag of tha stack 

I 

frame which eoataias it* Standard P-coda uaas a modlflad version j | | 
of this schema* Rather than specifying the lexical level 
dlrectlyi It Instead specifies the difference between the current 

I . 

lexical level and the lexical level of the desired operand* 

Thus I If the current lexical level Is 4 and the desired variable 
Is at offset 43 at lexical level i| the lexical address Is 
(1»43), which standard P-code expresses as (4*1,43) or (3,43)* 

i 

The use of lexical-level addressing Is a powerful technique* 

However, the allocation of memory directly on the runtime stack 
presents problems for Parallel P-code* First, one of the goals • . 

‘ I 

Is that Parallel P-code be machine-independent* This precludes 
the use of compiler-calculated offsets for variables within a 
stack frame, since word sizes and data representations vary from 
machine to machine* Second, a parallel matrix processor which 
contains more than one type of memory (£•£• array memory and i 

scalar memory) cannot allocate all variables on one stack* 

Therefore, Parallel P-code implements a modified form of lexical \ 

I. 

addressing* 

Parallel P-code represents lexical addresses directly, I 

I 

rather than subtracting the lexical level of the operand from the 
current lexical level* This definition is more intuitive and 
constitutes no loss of information* Parallel P-code does not > 

define the exact format of a stack frame; specifically, it does j 

not define the format of the static and dynamic links* These are 
left to the implementation* This provides a degree of 
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flexibility - aa inplanantor oay wish to use a display l6l rathar 
than an axplieit ttatie link chain* 

To aliotinaea tha naad for eotapilar-ganaraead offtaes in a 
laxical addraaot Parallal P-eoda uaaa a aymbolie fora of laxical 
addrasaing* Each function or procedure argument and each local 
variable ia assigned an index number* Tha laxical address 
consists of tha lexical level and tha index number* For 
Instancsi given the function: 

function func(a*b: real) : intagar; 

var x,y: intagar; 

tha function result la index 0. ''a** is index 1> ^'b^' is index 
2t ^'x^' is index 3» and ''y'' is index 4* If this function is 
at laxical level S the lexical address of ''x*' would be (5»3)* 

Local variables* arguments to the function or procedure* and 
the result (if the routine is a function) are specified In 
Parallel P-code with the * ARC and * L0CAL pseudo operators* 
Arguments and local variables share the same set of indices; 
however* arguments require special treatment and therefore are 
warranted a separate declaration statement* The index 0 la 
reserved for the result of a function* It is unused for 
procedures* Arguments are defined with the syntax: 

*ARG index » type » rv 

where ''index'' is the index number* ''type'' is a type name* and 
''rv'' is zero if the argument was passed by value or one if it 
was passed by reference* Local variables are declared with a 
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siailAr t6ae«m«net 

•LOCAL £n4«x»typ«,ov«rIay 

Tha ''indax" and "typa" £lalde ara idaaeieal to ehoaa for 
•ARC* Tha *'ovarlay" fiald la almilar eo rha ''align'' Siald 
for tha • RECORD poaudo oparator^ It la normally aaroi Indicating 
that tha local variable ahould ba allocated the next available 
memory location (or locationa)^ If it ia non-zero » it apaclfiaa 
a pravloualy-daf inad local variable (at tha aame lexical level); 
the new variable la to be overlayed on the memory allocated for 
the specified old variable • The use of this feature to implement 
the with statement is described below^ 

Parallel P-code also defines explicitly the lexical level or 
each procedure of function^ Bach routine is proceeded by an 
• ENTRY statement and followed by an » EXIT statement* These 
specify the lexical level of the enclosed procedure* The . ENTRY 
statement also specifies the processor on which the procedure or 
function is to runt 

•ENTRY level»slte 

•EXIT level 

"level" is a lexical level number, and "site'' is either the 
literal string "HOST" indicating that the routine is to be 
executed on the host machine, or "MCU" indicating that it is to 
be executed on the (main) control unit of the parallel processor* 

There are no Parallel P-code instructions which directly 
allocate or release dynamic memory* These operations are 






ORIGINAL PAGE IS 
OF POOR QUALITV 

p«rfora«d by eh« tcandard pcoe&«^dur«t and ^'ditpoaa"* 

Thata proeaduvaa oparata in Paeallal P*eoda eha taaa way aa In 
atandard P-eoda» axeapt ehae ebay oay raeuen aiehar a acalae 
mamoty polntar or an array mamory polntar* Thay aceapt aa an 
oparand a polnear varlabla* 


4.4 Data Manipulation 
4.4.1 Ovarall Strategy 

Aa the previous sections have discussed, Parallel P-code, 
Ilka standard P-coda, Is a stack-oriented language. This section 
describes the overall data manipulation scheme In Parallel P- 
coda. The specific opcodes provided In Parallel P-code are 
described In full In Appendix B. 

Conceptually, the runtime stack for Parallel P-code contains 
quantities which are either scalars or are descriptors for an 
array or record type. At times, the stack also contains pointers 
to scalars (one might consider these to be scalar descriptors). 

When an operation is performed on scalars, the address where 
the result Is to be stored Is loaded onto the stack, the scalar 
expression Is calculated, and a "store Indirect'' Is performed 
to store the result of the expression (on top of the runtime 
stack) at the specified address (the second item on the runtime 
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la order 6o goaerallro eho eealar east eo aeruceurad eypaa 
(arraya and racorda)i It ia nacaaaary to dafina what to aaant by 
a ''load'' of an array or record* In Parallel P-eoda theaa are 
aJiwaya aeeeaaad through a daaerlptor* Since a daacriptor la 
aaaentlally a ganarallzed pointer* there ia a fundamental 
difference between manipulating acalara and manipulating 
atructured typea* In the former case* the value of the acalar ia 
loaded onto the ataek* manipulated* and atored* In the latter 
case there ia an additional level of indirection* 

When an operation is performed* the result must be stored in 
a temporary area and a descriptor for that temporary area placed 
upon the runtime stack* Considered in this fashion* the 
descriptors for the defined local variables are analogous to the 
addresses of scalar variables* and the descriptors of temporaries 
are analogous to scalar values on the stack* The automatic 
allocation of the temporary storage to which the descriptors 
refer is the responsibility of the implementation* 

4*4.2 Load Instructions 

Parallel P-code provides five instructions for loading data 
onto the runtime stack* 

The simplest instruction is LDC * which loads a constant* 

The constant is never an array or record* The constant may be an 
integer* floating-point number* character* Boolean value, or set* 


PAGE IS 
quality 
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Th« spaeiflcd cooteane it puthtd onto eht 6op of ebt ruoeiao 
•eaek* 

Two lateruciiont art peovldtd for loading addrataat* Tht 
firte, LCA » ia uaad 6o load eha addrttt of a eontianC* Tbt 
conteane it tpteifitd at in eha LDC intrruecion* Eathtr rban 
loading eha valut of tht conatantt howtvtr, LCA crtattt a 
eonttant in mamory and loada its addraat onto tht ttaek* Thit 
Inatruetion it utod whtn it ia noctaaary to pata a conatant 
etring ''by refaranca'' to a proeadura or function* Tha aacond 
Instruction! LLA « convarta a laxical addraao (laval »indax) to an 
absoluta addrass and puahes the addraas on tha runtioa stack* If 
the item is an array an array descriptor which opacifies tha 
location of the array but no indexing information is pushed; 
similarly! if the item is a record a descriptor which apacifias 
the entire record will be pushed* (Hybrids of arrays and records 
are handled in the same fashion! as discussed in section 4*2*6* 

Data ia loaded by means of the LOP and LDl instructions* 

LOP is used to load scalar values whose (lexical) address is 
known at compile^time* LDl deserves detailed attention* 

The syntax for LDl is: 

LDl type 

where ''type'' is the type of the data to be loaded* The top of 
the runtime stack contains a descriptor for the data to be 
loaded* If "type" is a non-array, non-record type, this 
descriptor is simply an address* In this case, the specified 
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dae« it lo«d«d doeo eh« seaek* Xf it «n ««e«y oe r«eoed 

eyp«» adde«tt«d data la eoplad to a taaporaty loeatlon (in 
althar tha array manory or tha acalar mamory) and tha top of tha 
runtlaa ataek la raplaead with a daaeriptor to tha taaporary 
location* 

Uaually, an LDI of an array or a raeord la radundant bacauaa 
tha daaeriptor will only ba uaad aa tha Input to a aubaaquant 
operation (a*£* ADD ). Howavar, In aoiaa eaaaa It la nacaaaary to 
praaarva tha dlatlnetlon batwaan tha variable Itaalf and Ita 
value* For lnatanea» If a variable la paaaad ''by value'' to a 
function, It la not aeeaptabla to paaa the original array 
descriptor; Instead, a descriptor for a copy of the array muat be 
passed* Where this distinction la not necessary, the 
Implsaentatlon may choose to Ignore the LDI (£•£• a simple 
optimization would be to omit any LDI whose result Is used in a 
subsequent expression)* 

When an LDI is performed on a file, the file buffer variable 
Is loaded onto the stack* 

4.4*3 Store Instructions 

Parallel Pascal provides only two store Instruetlons, 8T0 
and STO * STO Is used to store non-array, non-record data at a 
(lexical) address which is known at compile-time* STO is used to 
store all forms of data (on top of the runtime stack) Into the 
variable specified by the descriptor which Is next to the top of 
stack* In the case of a non-array, non-record, this descriptor 
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it siaply «a «dde«tt* Zn ehit e«t«« eh« eop of teock is copisd 
Ineo shoe sddrsss* Othscwlso» ehs does Indies&sd by ihs 
dsseripeot oo eop of ebs stsek is copisd into ehs stss indiestsd 
by tho dssetiptor nsxt to ths top of ths stack* 

ffhsn so 8T0 is pssfotmsd using s fils ss ths sddrssst ths 
top of stock is stored in ths fils buffer vsrisbls for ths 
spseifisd fils* 

4.4*4 Typs Conversions 

There are two msehenisms by which ths type of an item on ths 

« 

runtime stock msy be sltsrsd* It msy be explicitly coerced by 
the eVT or CVN opsrstori or, if it is s record typs, s field msy 
be sslsctsd with ths 3EL opsrstor* 

Ths eVT end CVM opsrstors are used to perform a variety of 
typs conversions* They are essentially the same operator, except 
that eVT operates upon the top of ths runtime stack, while CVN 
operates upon the next-to-top of the runtime stack* The syntax 
for these operators Is 

eVT oldtype,newtype 

CVN oldtype,newtype, tstype 

where ^'oldtype" Is the old type of the Item to be converted, 
'"newtype" la tUe type It Is to be converted to, and "tstype'^ 
(for CVN ) Is the type of the top of stack (this Information Is 
needed because stack Items may be different sizes). 

eVT and CVN perform three major functions* First, they 
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eoBvtee •ealavt or orrayt of • tliiplo eypo £ro« ono boot eypo eo 
onoebor* Aa oxoaplo of ehlo it eho coavortion of to trrty of 
Ineogort eo on trrty (»ieh eho ttnt thopo tnd trrty Indlett) of 
rttl nuabtrt* Steond* ehty eolltptt ott-»tltatne trrtyt ineo 
tetitrt* For ehit ettt> eho trrty dtterlpeor tptelfitt t tingit 
tltntfie* Third* ehty ttptod tetitrt ineo trrtyt* la ehit ettt* 
tvtry tltatoe of eht rttuleing trrty htt eho vtlut of eho tetltr* 
Tht rolt ehte ehttt tctltr*eo-trrty eoavtrtioat pity it diteutttd 
it mort dtetil btlow* 

BEL it uttd eo ttltee t fitld from t rtcord* Tht tyaetx iti 

BEL rtceypt*f itld»atweypt 

whtrt *'rtceypt" it eho eypo of eht rteord* * 'fitld" it eht 
atfflt of eht fitld eo bt ttlteetd* tad "atweypt" it eht eypo of 
eht rttule* "atweypt" it noe ntcttttrily eht eypt of "fitld" 
• if "rtceypt" it tn trrty of rteordt ehtn "ntweypt" will bt 
an array alto* Tht rteord dttcripeor on eop of eht runcimt teaek 
it modifitd eo ineludt eht additional fitld ttltecion 
Information* 

4.4.5 Conf ormability 

Ptralltl Paacal has striet cults rogarding the 
Gonformabllity of two Ittms which are used together * The 
conf ormability rules ensure that the specified operation is 
well-defined and efficiently implementable . Operands in Parallel 
P-code are also required to be conformable, although the 
requirements are less rigid than those in Parallel Pascal* An 
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op«r«tioo whieh it ptrforatd on ewo ooQ-eoQferaAbl* least la an 
areoe* Tha dlapoaleloa of ehla areoe coadieion la lafe eo eha 
Implaaantaelon* 

In Parallal P*eoda» eha opaeanda of a binary oparaeloa (ehla 
data ladttdaa eha **aeora" Inaerueelona) auae ba eooforaabla In 
ewo waya* Flrae, eha baaa eypaa of eha oparaada auae ba 
Idaoeieal* For Inaeaneoi le lo lllagal eo eoablua an Ineagar (or 
ao array of Ineanara) wleh a raal nuabar (or an array of raal 
auabara) wlehoue flrae axplleiely coavarelng out of eha oparaada 
•o ehae boeh art Ineagara or boeh ara raal nuabara* Thla 
eoavaralon la parformad by eha CVT and CVN oparaeora* le la alao 
lllagal eo eomblaa an Ineagar and a valua of aubranga eypa 
wlehoue flrae axpllcely coavarelng ona oparand ao ehae eha eypaa 
maech* 

In Parallal Paaealt arraya may ba combinad with acalara* and 
ewo arraya of eha aama ahapa may ba uaad eogaehar* (Mora 
praelaalyt ewo arraya whoaa non-acalar Indax rangaa ara Idaneical 
or axpllcely apeclflad» and which hava eha cama ahapa may ba uaad 
eogaehar*) In Parallal P-codai eha oparaada eo an Inoerucelon 
muae alwaya ba eha aama eypa* If a acalar la eo ba combinad wleh 
an array, eha acalar muae flrae ba axpanded eo an array of eha 
aama ahapa* Thla expanalon craaeaa an array daacrlpeor wleh 
^'blank'*' Indaxlng informaelon. For Inacance, if eha eop of eha 
runeima seack coneaina a deacripeor for eha array defined by 

var 

a: array [1**S] o£ Ineeger; 


fl 
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Thit «tsbe bt d«fia«d in Paralltl P«eod« att 


•ARRAY T6»lna«gae|l| 1,5 

•LOCAL 1,16.0 


To Inertaone all alaaanea of by ona. eha following aaquanea 

eould ba uaadi 


LLA O.l {artay daaeelpeoe for 
XXO TS 

LLA 0.1 $array daaeripior for 
ZXO TS 

LOl TS ;load ''a'' 

LOG ineagar.l 

CVT Ineagar.TS seonvarr oealar Co army 
XXO TS ;daflna indax rang# for now array 
ADO T8 
STO TS 


Tha LDC plaeaa rha ineagar eonaeane 1 oneo eha eop of eha aeaeh^ 
Tha CVT axpanda eha eop of eha aeaek Ineo an array of eypa 
''arr'\ ovary alaaane of which eoneaina eha valua !• (Tha 
raauleing array io allocaead in eamporary maaory and iea 
daacripeor raplaeaa eha ineagar on eop of eha aeack^) Tha eop of 
aeaek ia an array daacripeor wieh "blank'' indexing informaeion. 
ao eha XXO ia uaad eo aalace ovary alaaane of eha (nawly-eraaead) 
array* Tha ADD ehan adda eogaehar eha ewo arraya whoaa 
daocripeora are on eop of eha aeaek* 


Tha craaeion of a "blank" array dascripeor by eha CVT 
inaerucelon allows a scalar eo inearaee wieh any subsae of an 
array* In eha previous axampla eha anelra array was salacead; 
however, a subsae can also ba aaslly ineramanead* Tha oparaelon 

a[0@l..2] a[0@l.*2] + 1; 


would ba loplaoantad by 




L l gaS T MilitfliM 


- -4=it;:r«4K^i:'J3;=W«)a*£?3!S^ 


4'.. 


77 


g >BIiriiI/;irij i i.jiji'Jii i > i i. l AiaMfe ,. 


LLA 

O.l 

omeiNAL PME n 
OF POOR QUAUTV 

LDC 

lntagar»l 


LDC 

Integer »2 


1X2 

T8 sarray 

descri>tor for *'e(8!**2l" 

LLA 

0,1 


LDC 

Integer, 1 


LDC 

Integer, 2 

descriptor for **et0l**2]" 

1X2 

T3 sarray 

LDC 

Integer, 1 


CVT 

Integer, T8 

Sblank descriptor for constant array 

LDC 

Integer,! 


LDC 

Integer, 2 


1X2 

T8 s*«l«r6 

subrange 

LDl 

T8 sload * 

'a(8l,.2r' 

ADD 

T8 


STO 

T8 



Bacauso Che opa?aada to all Parallel P-eode inatructiona 
must ba tha sama tppa, whan two array sagmanta (that is> arrays 
or oubaats of arrays) with the same shape but different types are 
combined » ona must ba converted to conform to the other one* For 
example! given the Parallel Pascal statements: 


var 

at array [1**S1 of Integers 
bt array [S**9i ^ Integer; 

a :» a 4> b(@5* *93 1 

the Parallel P-code definitions might bet 

.ARRAY T5»lnteger I 1| lt5 

• ARRAY T6|lntegerf lfS|9 

.LOCAL l!T5»0 

.LOCAL 2,T6,0 


The two arrays would be loaded onto the stack by constructing 
their array descriptors and performing a LDl t 


. * ■ ■ 
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Id 

SraooAuty 


LLA 

0.1 

sarrey deserlpcor 

for 

1X0 

T5 



ILA 

0.1 

;array deserlpcor 

for 

1X0 

TS 



LOI 

T4 

;loed "a" 


LLA 

0.2 

sarray daserlpcor 

for 

IXO 

T6 



LOI 

T6 

{load "b" 



Ho»«v«t» b«for« eh«t« two strayt can ba addad (and eha raaule 
•corad) » it la naeaaaary co convare Cham Co cha sama cypa» For 
axamplO) Cha cop-of -acaek can ba eonvarcad from cypa T5 Co T4: 

CVT T6,T5 


Tha arroy Is eonvarcad In camporary memory and cha array 
deserlpcor for cha reaulc raplaeas cha array daaerlpcor on Cop of 
cha scaek. (Noca chac Che rasulclng daserlpcor la noc * 'blank'' 

- chac 18» cha Index ranges are filled In by CVT * "Blank" 
Indexing Informaclon only resulca whan a scalar la expanded co an 
array* ) 


In some eases » such as Cha example above » Cha Index range of 
Che array co be eonverced does noc fall wlchln Che Index range of 
Che Cype ic la eonverced co* In chase eases » che Implemencacion 
of CVT muse adjusc Che Index ranges so chac chey fall wlChln Cha 
dimensions of Che resulc cype* (If a dimension of che result 
cype Is not large enough co concaln a dimension of the operand an 
error has oeeurredf for in this ease che two array operands can 
not possibly have Cha same shape*) After the operands have been 
converted Co identical types, che arrays may be added and che 


result stored: 
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STO TS 


Ae ewo teack oparanda will hava tha aama iypa bue eha 

Indax rangaa oC eha ewo oparaoda will ba diffarant* Tha 
iaplaaaacatlon ouae ^'ahife" ana of eha oparanda ao ehae Cha 
aeelva array alaaanea ''lina up"« Tha cholea of which operand 
CO ahife io lafe to cha implamancaeion axcape in tha caaa of a 
STO inacruccion; in Chia caaa cha data baing atorad muae ba 
aligned with cha eubsac of eha array ie ie co ba aeorad into* 


Finally, if aealar indexing ia uead Cha logical ahapa of eha 
array ia alearad* Thua, given eha definition: 


var 

a: array [l»*S,l.*S] o|, incagars 
which Qighe be defined in Parallel P-coda by 


•ARRAT T8,ineagar,2, 1,5,1|S 

•LOCAL 1,T8,0 


ehan eha acaeemene 


atl»l at,I)s 


would ba eranslaead ineo Parallel P-coda as 


LLA 0,1 
LDC ineagar,! 

• ARRAY T9, ineagar, 1,1, S 

XXI T8,T9 ;index by scalar -- note eypa conversion 
IXO T9 {salece aneire dlmansion 
LLA 0,1 

IXO T8 ;8elacC entire dimension 
LDC integer,! 

1X1 T8,T9 {select entire dimension note type conversion 

LDI T9 {load column 
STO T9 {Store in row 
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This illuetratss ths dlfforsnee between the static (logical) type 
and the dynamic (physical) information which is contained on the 
run-time stack* Clearly the data allocations for the two ''T9'' 
types are non-identical; however* their logical types are 
identical and hence the two arrays are conformable* 


The 8 EL instruction* like the 1X1 J’iStruction* provides 
additional information to the dynamic descriptor (physical 
specification) and alters the static descriptor (logical type)* 
For example* given the definition: 


var 

a: array [1**S*1**S| of integer; 
r: array 11**51 of 
record 

x: array |1**51 of integer; 
y: real; 

ends 


which might be represented in Parallel P-code as 


•ARRAY 
•ARRAY 
•RECORD 
•RECORD 
. ARRAY 


T8* integer , 2* 1 * 3* 1 * 5 

T9*integer*l*l*3 

T10*x*nil*T9 

TIO *y *nil *real 

T11*T10*1*1*5 


•LOCAL 1*T8*0 

•LOCAL 2*T11*0 


then the statement 


a :• r*x; 


would be translated Into Parallel P-code as 
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LLA 

0,1 

{"blank" ( 

descriptor for "a" 

XXO 

T8 



XXO 

T8 

{descriptor 

for entire array "a" 

LLA 

0,2 



XXO 

Til 

{descriptor 

for array of entire records 

SEL 

Til, 

x,T8 


XXO 

T8 

{descriptor 

for entire array "x" in "r" 


The phyaleal storage corrsspondlag to all elements o£ the array 
''a'' is clearly different than the storage corresponding to all 
elements of the field the array of records However, 

their logical types are Identical and hence the two Items are 
conformable* 


4*5 Standard Functions and Procedures 

Parallel Pascal, like standard Pascal, provides a set of 
standard functions and standard procedures to perform tasks which 
are difficult or Impossible to specify directly* Standard 
functions and procedures can In some sense be considered as 
extended operators, for the types of (and often even the number 
of) their parameters may vary* 

In P-code, the arguments to a standard function or procedure 
are loaded onto the stack and the routine Is called* At the P- 
code level all standard procedure calls have a fixed number of 
arguments and a fixed type* When a Pascal procedure such as 
''write'' has multiple arguments of different types It Is 
Implemented as a series of calls to flxed^format routines such as 
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S2 


(wrie« lae«g«t)i '^wsv'* (writ* r«Al)» «ee* 

In Parallnl Paacal somn seandard £uacciona may ba callad 
wleh a variable number of argumenta which cannoe be eerialised 
into A set of £ixed*£ormat ealle* An example ie the '''•hift'' 
function, for which the number of dimenaiona of the firat 
argument (the array to be ahifted) determinea the number of 
parametera which are paaaed to the function* 

To deal with the variable number of argumenta and the 
varying typea of the argumenta (aince arrays of any shape may be 
operated upon) Parallel P-code uses a modified calling sequence* 
First, the stack is marked with the M8T instruction* Standard 
functions and procedures are considered to be at lexical level 
zero; no other routine are (the outermost block - the program 
block - is at lexical level 1)* The arguments are then computed* 
Scalars are treated as in standard P^code: if passed "by value'' 
a scalar expression is evaluated; otherwise, the address of the 
scalar is passed* Arrays and records are always passed as 
descriptors* If they are passed "by value" an LDl is 
performed, so that the descriptor points to a temporary-storage 
copy of the data, rather than to the original variable* If they 
are passed "by reference" the original descriptor is passed* 
Finally, the CSP instruction is used to call the standard 
procedure or function* If the called routine is a function it is 
responsible for storing its result on top of the runtime stack 
when it exits; in all cases the return from the called routine 


..w- •• •■■ .:^-.i..jc-s:.-aex3KWS«»?»*»a=eK3:^ I im - i 
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will r«s«e eh« teack back eo eha narkad locaeicn* 

Standard proeaduraa and functiana operate principally upon 
an Iteo of a particular data type; the other arguoeiita have fixed 
data types* For Instance* the ''shift'' routine operates upon an 
array whose type and shape oay vary; the additional operands are 
all Intagers* Parallel P-code provides a oeehanism for 
specifying the logical data type of this primary argumant as well 
as the result type of the function* Thus* the format of the C8P 
Instruction Is: 

eSP stdfunc »argtype > restype 

where "stdfunc" Is the name of the standard function or 
procedure* "argtype" Is the (logical) data type of the primary 
argument* and "restype" Is the (logical) data type of the 
function result* (If the called routine is a standard procedure 
the literal string "nil" will be used*) 


4*6 User-defined Functions and Procedures 

Unlike the standard functions* a user-defined function or 
procedure Is always called with a fixed number of arguments whose 
types are constant* This leads to a regular structure for 
calling user-defined routines 

As discussed earlier* standard P-code and Parallel P-code 
both organise memory allocation on the run time stack Into stack 
frames* Each stack frame contains the static and dynamic links* 


eh« s«turn addf«ttf tha aegumaoea to tba eottt»lda» and ^ha local 
varlablaa* (Xa eha eaaa of Paeallal P-eoda ehaaa variablaa ara 
ayabolically apaelfiadt) 

la acaadavd P«*codat a aubrouelaa oe fuocelon call lavolvat 
aavaeal acopa* Firatt tha aeaek ia '-naakad"* Thia aata up a 
naw seaek ftama aad fllla ia tha atatle aad dyaamie liafea* Hftxt» 
each aeguoaat to tha routloa ia gaaaratad* For aegumaata paat^ad 
''by value'' thia iavolvaa tha avaluatioa of tha arguoaat aa aa 
axproasioa on tha ataek; for argumaata paaaad "by rafaraaea" 
thia conaiata of loadiag tha addraaa of the argumaat oato the 
atack* (Staadard P-code haa ao proviaion for paaaiag procedure 
or functioa parametera; theae are therefore not implemented in 
the P4 compiler*) After all of the argumenta have been prepared 
the function ia called* When the function returnai the ataek ia 
reaet back to the marked location* If the called routine was a 
function, the function reault ia left on top of the runtime 
stack* 

The calling sequence in Parallel P-code is very similar to 
the standard P-eode case* The ataek ia marked with the MST 
instruction, which specifies the lexical level of the procedure 
or functioa to be called* The arguments are then computed* 
Scalars are treated as in standard P-eode: if passed "by value'' 
a scalar expression is evaluated; otherwise, the address of the 
scalar is passed* Arrays and records are always passed as array 
descriptors* If they are passed "by value" a LDl Is performed, 
so that the descriptor points to a temporary-storage copy of the 
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array or raeord* rachar Khan eo rha original variabla* If ehay 
ara paaaad "by rafaranca" eha original daaeripeor ia paaaad* 
Parallal P-eoda» lika aeandard P-coda» doaa noe provida for 
paaaing funceiona or procaduraa aa paramatara. 

Tba funeeion or proeadura ia eallad with eha CUP 
inaerueeion» which haa tha ayneaxt 

CUP laval »roueinanamairaauleeypa 

whara "laval" ia eha laxieal laval of eha eallad roueinai 
" roueinanama" ia iea namat and "raauleypa" ia eha daea eypa 
of eha funeeion raaule* (If eha eallad roueina ia a proeadura 
ehla will ba eha liearal Bering "nil"*) 

Tha funeeion raeurn ia parformad by eha RET inaerueelon* 
Thia inacrueeion haa the ayneax: 

RET eypa 

where "eypa" ia eha eypa of eha funeeion raaule* In eha eaaa 
of proeeduraa, "eypa" will ba eha liearal aering "nil"* 
Local variable 0 ia uaad by funetiona eo hold eha funeeion 
raaule* Whan eha RET inaerueelon la axaeuead* eha aeaek la 
''popped" back eo eha caller » and eha funeeion raaule (if eha 
called roucina wae a funeeion) la lefe on eop of eha runelme 
aeaek* If eha raaule waa an array or record eha eop-of-aeack 
will be an array descrlpeor* 


OF POOH QUALrrV 

4*7 Coodlttottal Exaeucteo 

P«rall«l Patealt Ilka aeandard Paaeal» psovldaa a aat of 
eonerol flow eooaeruecat In atandaed P-eoda» ehaaa ara 
Implananead with four ''Juap" lot e rue 6 Iona t &JP » PjP > U_J_P ■ and 
UJC > Tha laplamaaearion of eha caaa > for * whlia ■ and 
rapaac-unell afiaeamanea In Parallal P*eoda la Idaaeleal £o eha 
Implamaneaeion lo atandard P«coda* Tha XJP and PJP Inacruerlona 
uaa eha eop of tha runelma aeack aa an oporand - aa an Indax Ineo 
a eabla of addraaaaa ( XJP ) ■ or aa a Boolaan coodieion ( FJP or 
''Jump If falaa'')» and In both of ehaaa caaaa eha quaneiey on 
eha aeack muae ba a acalar* 

Parallal Paaeal alao provldaa eha whora aeaeamane eo apaelfy 
condlelonal aaalgnmane of axpraaalona eo arraya» controlled by an 
array oxproaalon* Tha whara aeaeamane cannoe ba <af f Iclanely) 
implamanead wleh eha acalar-orlanead control machanlama of 
oeandard P-coda* 

Parallal Paaeal apaelflaa ehae array aaalgnmanea ara 
condlelonal wlehln eha body of a whara aeaeamane* Thua» eha 
Implamaneaeion in Parallal P^coda will only afface eha STO 
operaeor (eha STO oparaeor la navar uaad eo aeora array data)* 

The MPP and most SXMD-class parallel 
procaasore asaociaea with aach procaaaor a flag known aa eha 
"mask bit'' or "aceiviey bit". This bit conerola whether or 
not the procaaaor ia enabled or disabled* The collection of maak 
bits for each procaaaor can be considered to be a "mask array"* 
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4 

This Array eaa furehar b« eonAidarad eo ba a Boolaaa array» aloea 
eha valuaa it conealna ara binary* THa eonerolling aapraaalen o£ 
a whara ataeamanr in Parallal Paaeal la a Boolaan arrays hanea» 
it It naiural eo laplaaane eha whara teaeamane by utlng ehla 

array aa a aaak array* 

Parallal Paaeal paraiea whara aeataaanea eo ba naaead* All 
o£ eha math arraya in aueh a naaead eollaceion of aeaeaaanea mute 
hava eha aama ahapa* Thua* ehara la a naad for a naaead aaquanca 
of eondlelonal axpraaalona* 

Tha eurrane eondlelonal aeaeua of a aae of N naaead 
eondlelonala ean ba daearmlnad by ualng a aeaek of langeh H 
(blea)* If eha eurrane eondlelonal aeaea la A» and a whara 
aeaeaaane la aneouncarad whleh avaluaeaa eo B* eha naw 
eondlelonal aeaea la AB (eha Boolaan produce of A and B)* Ac 
aoiaa laear polnei If an oeharwlaa la ancounearad* eha daalrad 
eondlelonal aeaea la AB* Taking aa eha aymbol for a 

Boolaan '^Inelualva or" and as eha ayabol for a Boolaan 

' 'aKclualva-or " • and raealllngt 

xOy s x7 xy 


xy 3 x+y 


eh«n 
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AB#A i (AB)A 4* (aI)A I AAB 4> (A4*B)A 
i (AA)B 4* AA4-AB i AB 

Utlag ehls ratulCi eh« •eaek-orlaactd ispl«a«neaelon can ba 
daf lead* 

Inleially eha aeacb ia ampty* and all procaaaora art 
anablad* Whan a whata condlelonal la aneountavad» a Boolaan 
'^and'^ ia parformad with tha currant top of tha ataek (If tha 
stack Is non-aopty) and tha rasult Is puahad onto tha stack* 

Whan an otharwlsa la ancountarad» a Boolaan '^axelualva or'' Is 
computad batwaan tha top two alamants of tha stack and tha rasult 
raplacas tha top of tha stack* (If tha stack contains only ona 
ltaa» a Boolaan ^^not" - complamant - Is parformad*) Whan tha 
end of tha conditional Is ancountarad* tha stack is popped* 

Tha Implamantat Ion of nested conditional masks In Parallel 
P-coda Is based upon this algorithm* A new conditional 
expression Is pushed onto tha mask stack by tha WHR Instruetloni 
which has tha syntax: 

WHR type 

Tha sense of the most recent conditional Is reversed with the OTW 
instructlor., which has the syntax: 

OTW type 

and the mask stack is "popped" by the ENW instruction, which 
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BNW eyp« 

Whtn tt n«w o£ matkiag It •oe«ead> eha aatk taptaasloo 

it eoapuetd on eht ruiieiat te«ek» and than eba WHE aeaeaaane la 
uaad eo pue eha aaak Ineo affaee* If sha lapXaaaaeaeian o@ 
daaltaa, eha WHR iaatruaeian may opeionally nofc pop ehia 
axpraaalon off of eha run eiaa aeack, providad ebae eba BNW 
Inatruetion doaa* Tbit raealaa eha eaaporary aaaory wbieb holds 
eha aaak axpraaaloa* If a aae of naaead eoadieionals art 
avaluaead* ebara will ba a aae of eaaporary aaak array 
daacrlpeora oa eba run eiaa aeaek* Tba iaplaaaaeaeioa aay eban 
uaa eba aeoraga eo whieb ehaaa rafar eo iaplaaaae eba aaak aeaok* 
(Tbia aaebod * raquiraa a aaall aaoune of eaaporary aaaory 
allocaead oueaida of eba run eiaa aeack which holds poinears eo 
all of eha dascripeora on eha aeack.) 

Parallel Pascal apacifiaa ebae eba afface of a mask is noe 
eranaaieead eo a callad funceion or procadura. Tbusi aacb 
procadura or funceion baa ies own mask aeack. Tba iaplaaaneaeion 
may cboosa Co Includa eba informacion abouc eha mask aeack for 
aacb roueina in eha aeack frame (£*£. along wieh eba seaeic and 
dynamic links, aec.). 


to 
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4*8 Th* **wl6h** 8ed6«a<afe 

Boeh teAndard PaieAl and Parallal Paaeal prevlda eha wish 
■eaeaaane eo raduea eha naad eo fully apaeify raeord aeeaaaaa* 

Poe Inaeanea* eha following aaquaoea of eodai 

eacpf •aobeae*x i* 0| 

eacpt •aubrac«y t» 0| 

racpt •aubrae*a !■ 0{ 

could ba wrieean aa 

wleh racpt*aubfae do 
bagln 

X :• 0; 

y *- 0{ 

2 t" 0) 
and 8 

Tha uaa of a wleh oeaeamane haa ewo principal advaneagao* 

Tha flrte advaneaga la ehae eha program noeaelon la almpllflad by 
allmlnaelng rapaeielona of eha aama record apaclf Icaelon* (Thla 
la not alwaya an advaneaga » howavar» In programa wleh many record 
eypaa» bacauaa ae elmaa le bacomaa dlfflcule for a human eo heap 
esach eha record eo which each componane rafara*) Tha aacond 
advaneaga la ehae eha addraaa of eha racord Co which eha flalda 
belong la calculaeed only oncoi raehar than each elma a field la 
refarancad* For Inaeanca* la eha flrae example above (wlehoue 
eha with atacemant) eha pointer ''raep^' la da-raferancad three 
tlmaa* When tha with atatement la uaed (the second example 
above) I the addraaa computation la performed only once* 

To determine the Implementation In Parallel P-codOt It Is 
useful to first conalder the Implementation In standard P*code« 
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Xi?o eypts of eoaeedo oro uood •• oporoado to o wish ■totomont • 
ehoso which «eo "aoemol'' varioblot «od ehoto which owo 
•poeificd by « rocoed poiaeor* 


la eho caco of oou*poiatoe vcticbloo* fho (loxieol) oddroto 
of eho ipoelfiod roeord is knowo oe conpllo eioo* Uhonovor a 
rofoffoaeo it nodo eo t fitld io eh«e rtcord eho compilor tdjuttt 
chit addrttt by Cht offttc of Cht tpteifitd field and loadt Chat; 
addroat oa tha teack. Thua» Cha (laxical) addrttt of ovary fiald 
spacifiad in tha with tcataaaae it kaown at coopila tima if tha 
arguQoat to tha with it an ordinary variable* 


Whan a pointar it uaed in a with atatamant» tha value of the 
pointer mutt be pratervad to that rafaraneat to fields in the 
record indicated by that pointer can be proparly addressed* In 
this case* not even the address of the pointar itself is 
necessarily known at compile time* For instance* given the 
following code segment t 

type 

rec • record 

x: integer; 
yt real; 

end ; 

ptr • tree; 
pptr - tptr; 

Z,\£. 

pps pptr; 

begin 

new(pp); (* *'pp" points to an object of type "ptr” ’*) 
new(ppt); (* ”ppt'* points to an object of type "tec” *) 
with pptt ^ 

X :• 0; 

• • • 

C- a- 


/ 
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the addrese of the pointer Itself Is clearly not known 

at complle-tlme ) • Hence, it is necessary to compute the value of 
the pointer and save it In a temporary stack location* The P4 
compiler "knows'' the size of each stack element and can compute 
the address of the temporaries. When a field within an 
applicable record is referenced, the pointer is loaded onto the 
top of the stack and the offset (in this case, the offset of 
''x'' relative to the start of the record) is added* This 
produces the effective address of the desired operand; it may 
then be used as the address for a LDI or STO * 

In Parallel P-code, the situation is somewhat different 
because the ''front end'' of the compiler is unaware of the 
layout of records* A complle-tlme'-cons tant (lexical) address for 
an ''ordinary'' record (that is, one which is not accessed 
through a pointer) cannot be calculated. (Actually, the 
compile-time address is known if the record in question is not 
itself a component of another record*) In addition, the 
expression may be an array of records rather than a single 
record* For this reason, the Parallel Pascal compiler always 
calculates the value of the address expression in a with 
statement. The descriptor for the record is loaded (either by 
construction from the complle-time-known lexical address and a 
sequence of SEL instructions, or - in the case of a pointer - by 
loading the ''pointer'')* This descriptor is then stored in a 
temporary variable* When a record field is used, the descriptor 
is loaded from the temporary variable, the necessary SEL is 


owcmrJ. wrx « 
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parformed to accaee tha dealrad flaldi and the resulting 
descriptor is used* 

Unlike standard P**code» which stores only tha address of the 
record, Parallel P-code must store the entire descriptor. 

(Recall that descriptors in Pavallel P-codo play the same role as 
pointers in standard P-*code*) This presents a problem, because 
several different descriptors may be required at different times 
during Che execution of a routine. It is desirable to overlay 
Che space which is allocated to them as much as possible* The 
need for this sharing of temporary storage contributed to the 
syntax of the . LOCAL pseudo operator, described above* 

Briefly, the syntax of the . LOCAL pseudo operator is 

.LOCAL index , type , overlay 

where ''index'" is an index Chat symbolically identifies the 
location in the current procedure activation, ''type'" is the 
data type, and ''overlay"' is used for memory sharing. The 
''index'" and ''type'" fields are described in more detail above; 
the ''overlay'" field is of interest here* 

Local variables are allocated according to the following 
rule: if ''overlay'" is zero, allocate Che new variable beginning 

at the next available location; otherwise, allocate the new 
variable beginning at the same location as the variable whose 
index is ''overlay''. Thus, the following statements: 
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• LOCAL 1, integer, 0 

• LOCAL 2, real, 0 

•LOCAL 3, Integer, 2 

causes local variables 2 and 3 to be allocated In the next 
available memory after local variable !• If possible, variables 
2 and 3 are to share the same memory^ (The Implementation Is 
free to Ignore this memory sharing specif Ication^ This may be 
necessary If the variables would reside In different memories ** 
something which Is not the case for descriptors^ However, 
variables which are defined to use disjoint memory must Indeed be 
allocated that way; otherwise unexpected memory sharing will 
result,) 
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5; MEMORY MANAGEilENT 


5 . I The Memory Problem 

The language Parallel Pascal was defined in section 2, and 

an implemenCation using Parallel P-code was described in section 
3. The language definition for Parallel Pascal places no limits 
upon Che ucllizacion of memory by a program. Similarly, Parallel 
p-code has no inherent restrictions on the size or shape of 
arrays, whether they are parallel or not. 

Although the specifications of the high-level and 
intermediate-level languages contain no size restrictions, the 
first implementations of Parallel Pascal almost certainly will. 
The implementation of the language is affected by two fundamental 
hardware constrants: the size of the processor array and the 
amount of memory in each processing element's local memory. In 
this section, these problems will be considered relative to the 
implementation of Parallel Pascal on the Massively Parallel 
Processor[ll. 

The first restriction arises from the fact that the MPP has 
a 128x128 element processor array. If data arrays are declared 
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with thate dlmanalona thara la no problam; howavar, arrays with 
much larger or much smaller dimensions raqulrs special 
consideration* The problam of assigning memory locations to 
variables is examined In section S*2* 

The second implementation restriction results from the very 
small local memory in each processing element* Each PE has only 
1024 bits of random'-access memory* Although this amounts to two 
megabytes of memory for the whole array, it is limited when 
manipulating large amounts of data ~ each local memory can 

store only 32 (32-blt) floating-point numbers* The main memory 
is supplemented by two levels of secondary memory: a high-speed 
random— access memory called the ''staging buffer'', and a 
secondary input-output system (connection to the host computer or 
a proposed parallel disc system)* The staging buffer in the 
initial delivered version of the MPP will have a two megabyte 
capacity; plans are underway to eventually expand to sixteen 
megabytes, out of a total capacity of 64 megabytes* Section 5*3 
examines the efficient management of memory with this 
configuration* 
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5.2.1 Introduction 

In this section, the allocation of storage on the PE array 
1s considered. To facilitate discussion, the following 
definitions are assumed: 

NROW - number of rows In the processor array 
NCOL ■ number of columns In the processor array 

An array whose last two dimensions have sizes NROW and UCOL, 
respectively, can be easily mapped into the parallel array 
memories ** if the array Is declared as: 

va r 

arr: parallel array [I..J,K..L] oj^ something; 

where NROW-J-I+l and NCOL-L-K+l, then ''arr[l,kl'' will be stored 
In row 1"I, column k»K. 

This storage mapping can be extended to multl-dlmenaional 
arrays whose last two Index ranges have sizes NROW and NCOL, 
respectively. The address within an array memory is computed 
according to the usual formula. As an example, given the 
definition: 

va r 

arr: parallel array [9 .. 12 , 4. . 5 , 1 .. 128 , 8.. 135 ] o_£ Integer 

and assuming that the base address within the PC memories for 
''arr'' is a^ and that integers are stored in 16 bitplanes, then 
' ' arr [ i , j ,m, n ] ' ' will be stored in row m-1, column n-8 , at 


address 
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When Che last two dimensions of an array do not match the 
size of Che PE array another mapping strategy must be used. 

There are two cases to be considered •* data array dimensions 
smaller than the PE array size, and data array dimensions larger 
than the PE array size. 

5.2.2 Small Arrays 

If an array dimension Is smaller chan the corresponding 
dimension of the PE array, then some PE's will not be used to 
score Che array. For Instance, a 64x64 array will only use one 
quarter of the PE's in the MPP. This manifests Itself In two 
ways. First, It Is extremely wasteful of the main memory (a very 
precious commodity). Second, since the edges of the array no 
longer coincide w. th the edges of the PS array, rotating data 
through the array will require more than one cycle per position 
rotated. 

One possible way to store a small array would be to use a 
contiguous subsection of the hardware array; to store a 

32x32 array on Che MPP one could use all PE's 

(l»j)i 0<i<32, 0<J<32. This implementation works well if only 
elemental operations are performed upon the data. In addition, 
•he ''shift'' function can be used without substantial overhead 
(an extra cycle is required for each shift from the ''east'' or 
''south'' in order to force the incoming values to zero). 


However, the 


% ^ 


rotate 


/ / 


function will be very expensive because 
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It vlll bt n«c*ssary to propagate data ahifted out one end acroee 
96 Inactive PE'e to reach the other and of the array* 

An alternate etorage echeme would be to etore in every 
fourth PE in each direction; that ie » the 32x32 array deecribed 
above could etore array element (1 , J) in PE (ix4 , jx4) 

(becauee 128/32 4). Thia echeme hae the advantage that data 

can be rotated acroee the logical array without having to 
propagate edge information across a large number of Inactive 
PE's. However, each position shifted now requires 4 hardware 
shift operations instead of one because the data items are 
further apart. In addition, this scheme works best v;hen the data 
array has a dimension which evenly divides the length of the 
hardware array; the implementation is not as apparent if the data 
array is, say, 59x59* Finally, if arrays are stored in non- 
contiguous PE's then it will be necessary to compress or expand 
them when subsets interact with subsets of arrays of different 
sizes. For these reasons, it seems advisable that small arrays 
be scored in contiguous subsets of the hardware array. 

Since small arrays do not occupy every memory module in the 
PE array, sev'ral small arrays can share the same memory offset* 
For instance, four 64x64 arrays can be stored at the same PE 
memory address in the 128x128 MPP array. Arrays of different 
sizes can also be stored together - the available PE's in a 
memory plane can be ''parceled out'' as required. 

If a small array has more than two dimensions, one possible 
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w«y of ttorlng th« array la to atora covoral aubarraya In tha 
aama mamory plana(a)* For Inatanca, a 4 ,k64x 64 array could ba 
atorad at the aama address In the 128x128 MPP memories* This 
storage schema has the advantage that all elements of the array 
can be operated upon at once* The disadvantage of this scheme is 
tha time required to transfer data from element (i.»Jl,ib) to 
element (^-tl - in the case described above 64 shifts are 

required; whereas if the storage were ''vertical'' the data could 
be handled internally by the PE's* 

A special case of the general problem of small arrays is the 
handling of vectors* Although a vector has only one dimension, 
it is frequently convenient to manipulate a vector in the PE 
array* In addition, a vector may result from a reduction 
operation upon an array (e^*£* using ''sum'' along the columns of 
a matrix)* A reasonable storage implementation would be to treat 
a vector as a matrix with only one row or one column (whichever 
is more convenient for the problem at hand)* Like other small 
arrays, several vectors could then be stored in the same memory 
plane (£•£[.* in successive rows of the PE array)* 

Finally, if a matrix or vector is very small the 
implementation may wish to ignore the parallel specification and 
store the array in the scalar memory* Small vectors and matrices 
can easily be accomondated there without incurring a significant 
storage overhead* If the array is small, not much parallelism 
will be sacrificed by performing operations with it serially 

ORIGINAL PAGE IS 
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Th« Ability to apaclfy th« datA atorage format for amail 
arrays is lacking in ParalLal Paacal* The deciaion about atorage 
formats is therefore left to the code generator* A future 
Language revision might Include some provision for spocifying the 
data storage, along the lines of the parallel keyword which 
Parallel Pascal does provide. 

5.2.3 Large Arrays 

Large arrays present a different set of problems which mtist 
be addressed. First, slice the PE array is smaller than the data 
array, no mapping of the last two array dimensions into the PE 
array will be one-to-one* Instead, some PE's will contain more 
than one point. Second, large arrays seriously iv^act the main 
memory capacity - if the array la too large it may not fit In the 
main memory at all. An example of this case is a 2048x2048 array 
of integers, which requires eight megabytes of main memory (the 
MPP as delivered will have only two megabytes of main memory). 
Finally, if the data array size is not an even multiple of the PE 
array size, extra operations will be required when data rotations 
are performed. This last case closely resembles the rotation 
problem for small arrays which was discussed in the previous 
section; it will not be considered further here. 

The first problem is the manner In which large arrays are to 
be stored. For convenience, the last two dimensions of the data 
ar’*ay will be assumed to be multiplies of the PE array 


102 


dla«nslons* In tha following diacutalon, cwo-dia«ntlonnl arrays 
ara contidarad ;or convanlanca additional dlaanalona ara 
Implamancad ' ' vart Ically ' ' within tha PE array and ara tharafora 
of no apaclal Intaraat hara. 

Let tha PE array hava dlmanalona (U , N) (both M and N ara 
128 on tha MPP) and tha data array hava dlnanaions (AxM , BxN). 
Thara ara many poaslbla ways In which to map tha larga data array 
Into tha PE array, but simplicity dictates that the array ba 
stored in such a way that each PE Is associated with Ax B 
e laments . 

If the data array is fairly small relative to the main 
memory of the matrix processor a 236x256 array of 8-bit 

data on the MPP) one possible implementation is to store In each 
TE memory a MxN subimage* That Is, If ''arr'' were defined by: 

(* NROW - A*M, NCOL - B*N *) 

var 

arr: parallel array [0..NR0W, O..NCOL] oj^ integer; 

and it were stored starting at , then (assuming an Integer is 
16 bits) arr[^,J^l would be stored In PC 



at address 

a^ + I6x[(i mod A)xB + (i mod B)] 

The advantage of this storage scheme Is that adjacent points are 
often in the same PE and no data transfers are needed. This can 
be very useful when operations are performed involving near 
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neighbors. Tho dlssdvsntsgs of this schsms Ij thst when a 
isctlon of the array la opsracad upon (s.£. a 128x128 p.iscs of a 
256x 256 array) chs parallsllaia will bs lower. Also, large a/'raya 
cannot be conveniently manipulated becauae they will not lit Into 
main memory. 

An alternate atorage method la to divide the large array 
Into ''chunks'', each of which la the same size as the PE array. 
Given the array "arr'' defined above, this scheme would map 
arr[J^,J[] (with base address a^ ) into PE 

(1 mod M , J mod U) 

at address 

‘o + + lu]l 

Che advantage of this scheme Is Its capability for performing 
operations on subsections of tho array with the maximum degree of 
parallelism. This facilitates the processing of largo arrays 
(which are too large for the PE memories). 

Another important factor in the choice of a storage layout 
for large data arrays is the ease with which the arrays can be 
transfetred into and out of the main memory. In general, the 
second format (breaking a large array into pieces) is easier to 
handle than the first format (storing a local subimage in each 
memory); however, on the MPP both formats can be handled - the 
staging memory is capable of ''crinkling'' an input imaj^a in 
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order to itoro a local window In aach PE. 

Tha initial implamantat ion of Parallal Paacal on tha MPP 
will raatrlct tha size of Che laat two dimanticns of a parallel 
array to be laaa than or equal to tha PE array aiza. Aa a 
raaulc, Che programmer will have to explicitly deal with large 
data arrays in one of tha ways daacribad above storing 

multiple points in each PE or dividing the array into 
''chunks'')* The latter is performed by dimensioning an 
128Nxl28M array as 
type 

large - parallel array [ I . . U , I . .M, I . . 1 28 , I . . 128 1 o_f integer; 

and logically associating the first and third dimensions of this 
array with the first dimension of the large array (similarly, the 
second and fourth dimensions of the declared array are associated 
with Che second dimension of the large array). Simple operations 
on the array can be expressed directly; statements such as 

a ; • fc + c ; 

mean Che same thing regardless of how the array is dimensioned* 
However, data movements in the large array mvist be mapped into 
movements in Che small array. The following two routines, 

''Ishift'' and ''Irotate'' Illustrate how a shift and rotate on a 
(logical) large 128Uxiz8M array would be implemented on an 
NxMx 128>:128 array. These functions also illustrate how an 
automatic memory allocation scheme would handle large arrays 


which are stored in 


' ' chunks ' ' 
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(* 

* LSHIFT - Shift Large array 

* 

* The array "a" (conceptually size N*MPPROW by M*MPPCOL| 

* dimensioned as a[ 1 • 1 t 1 • *MPPROW, I • (MPPCOL) ) is end-off 

* shifted. 

* 

* The shifting is done in two stages - by rows and then by columns. 
*) 

const 

MPP80V/ • 128; (♦ number of HPP rows *) 

MPPCOL ■ 128; (* number of IIPP columns *) 

IJ • 10; (* N*MPPR0W - number of conceptual rows *) 

M • 20; (* M*MPPCOL • number of conceptual columns * ) 

type 

MPPREAL - parallel array [ 1 . .MPPROW , 1 .. MPPCOL 1 of^ real; 

MPPBOOL - parallel array ( 1 .. MPPROW , I .. MPPCOL ] Boolean; 

LARRAY ■ array [ 1 . .N , I..M] o_f^ MPPREAL; 

function lshift(a: LARRAY; r,c: integer) : LARRAY; 
var 


bs r : 

0. .N; 

(* 

block shift amount 

( rows ) 

*) 


isr : 

0. .MPPROW; 

(* 

internal shift amount (rows) 

*) 

bsc : 

0. .M; 

(* 

block shift amount 

(cols ) 

*) 


isc : 

0. .MPP.COL; 

(* 

internal shift amount (cols) 

*) 




rotates 

*) 



begin 


bsr 

j ■ r 

d i V 

MPPROW; 

isr 

U 

1 

mod 

MPPROW; 

bsc 

; ■ c 

di V 

MPPCOL ; 

isc 

: ■ c 

mod 

MPPCOL; 


tmp :■ rotate(a, 0, 0, isr, isc); 

mask :■ shift(a«a, 0. 0, isr, 0); 
where no t mask ^ 

tfmp :<■ shift(S| 1, 0, 0, 0); 

mask shift(a*a, 0. 0, 0, isc); 
where no t mask ^ 

tmp :■ shiftCa, 0, 1, 0, 0); 

Ishift :• shift(tmp, bsr, bsc, 0, 0); 


end ; 
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( * 

* LROTATE - Rocate larga array 

* 

* The array "a" (conceptually aize U*HPPROW by Il*IiPPCOL, 

* dimensioned as u[ 1 . .N, I . .M, 1 . .MPPROW, I . .MPPCOL] ) is rotated 

* (circularly shifted)* 

* 

* The rotation Is done in two stages - by rows and then by columns* 
*) 


const 

MPPROW - 128; 
MPPCOL - 128; 
N - 10; 

M - 20; 


(* number of MPP rows *) 

(s number of MPP cclumns *) 

(* N*MPPR0W “ number of conceptual rows *) 

(* M*LPPC0L ■■ number of conceptual columns *) 


type 

MPPREAL • parallel array [ 1 ** MPPROW , 1 ** MPPCOL ] o_f real; 
MPPBOOL • parallel array [ 1 ** MPPROW , 1 ** MPPCOL ] o£ Boolean; 
LARRAY - array [ I * *N. 1**M] of MPPREAL; 


function 

lrotate(a: LARRAY; 

r,c: integer) : LARRAY; 


var 





be r : 

0* *N; 

(* 

block rotation amount (rows) *) 


is r : 

0* *MPPR0W; 

(* 

Internal rotation amount (rows) 

*) 

bsc : 

0* *M; 

(* 

block rotation amount (cols) *) 


Isc : 

0* *MPPC0L; 

i* 

Internal rotation amount (cols) 

*) 

tmp : 

LARRAY; 

(* 

temporary array *) 


mask 

: MPPBOOL; 

(* 

mask for Internal rotates *) 


begin 





bs r 

r dlv MPPROW; 




1 s r 

r mod MPPROW; 




bsc 

c d^ MPPCOL; 




lac 

:« c mod MPPCOL; 




tmp 

:<■ rotate(a, 0* 

0, 

is r • isc ) ; 


mask 

shift (a*a, 0 

, 0 

, isr, 0); 


where not mask dr 





tmp :■> rotate(a, 

1, 

0, 0, 0); 


mask 

:• shlft(a«a, 0 

. 0 

f 0, Isc); 



where not mas It ^ 

tmp rotate(a, 0, 1, 0, 0) ; 


1 rotate 


rotate(tmp, bar, bsc, 0, 0); 


end ; 
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5 . 3 Data Migration 

Bacauge tha main memory on tha MPP is so small » it is 
inevitable that many programs will require more memory than can 
be provided in the processor array alone* Thus, some form of 
data migration will be required* This can be implemented in one 
of two ways. First, the programmer could be required to handle 
all data migration* Second, an automatic memory management 
system could be used and the programmer could be unaware of the 
transfer of data between memories* 

In the following sections, the behavior of the HPP is 

( 

considered in an attempt to determine an appropriate memory 
management strategy* 

5.3.1 The Overlap Factor 

i 

The movement of data between memories will alow down the 
computation of the MPP system on a problem by some amount. On 
the MPP, input-output and computations may be overlapped* A 
quantity which is of some Interest is the execution time penalty 
for not overlapping Input-output with computation. Let T^ be the 
total computation time, T^^ be the total input-output time, and 
T^ be the time required if input-output is overlapped with 
computation as much as possible. Then f, the fraction of the 
possible speed obtained with non-overlapped input and output 
versus a fully-overlapped implementation, may be defined as 


T 

o 
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T is boundttd by tha inequalltlaa t 
o 

T>T » T>T , T<T^+T 
o lo o c o lo c 
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The last of these Implies that ftl. The first two Inequalities 
bound f from below: 

T T 


- 1/2 


1) 

If 

T >T, 
c lo 

then 

lOm C 

■ 






lo 

2) 

If 

T, >T 
lo c 

then 

T^-T ♦ f - - - 

^ Q 


3) 

If 

T -T, 
c lo 

then 

T bT bT + f B — ■ - 

S ^ "lo T^+T^ 

- 1/2 


In summary, 1/2<£<1, Thus, the maximum penalty for not 
overlapping lnput“Output and computation Is a speed reduction of 
1 / 2 . 


5.3.2 I/O-CPU Time Ratios 

Another factor of Interest Is the relationship between the 
time required for the Input-output associated with a problem and 
the CPU time required to solve that problem. 

Let S be the execution speed (In operations per second). 

Define T as the time required to perform one array operation, 
op 

Since the MPP has an array of 128x128 PE's, T^^ can be calculated 
by 

128x128 16384 

"^op “ S " S 

The MPP has a two-level secondary memory. Data Is 
transferred from the PE memories to the staging buffer by 
shifting it across the rows of the array. It requires 128 cycles 
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to ohlft a elngla blcplana Into tha array. Lat T ba tha CPU 

cpu 

cycle time and B ba the number of bltplanea to be ahlfted* 
Define T aa the time required to tranafer a data Item from the 


staging buffer to the PE memories. Then 

T • 128BT originm; 9 

« OF POM 


One possible configuration of the MPP Is to use a very* 
high-speed parallel disc system for secondary storage. Let L^ be 
the average seek and rotational latency of the disc, and R be the 
transfer rate (In bytes per second). Define T as the time 

p 

required to transfer data from the disc to the staging area. 

Then 


- L. 


128x128x8 1 

8 "r 


■ L. 


B 

2048^ 


(The number of bytes Is equal to the number of bltplanes divided 
by 8 . ) 


Define X as the ratio of the time spent on input and output 
to the time spent performing the operation. Then 


X 



T +T„ 
a 8 


op 


B 

128BT +L.+2048- 

cpu d R 

16384 




SBT 


128 


cpu 


16384 


SB 

8R 


The first term represents the contribution of the staging area; 
the second term is due to the disc access latency, and the third 
term is due to the disc transfer rate. 


The value of X represents the average number of operations 
per array element required to keep the PE array busy. If, 
instead, only aX operations are performed (where 0<a<I) then the 
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av«rag« uelliaation of cha PE array will ba a* Thua, whan X la 
large the Input-output will dominate unleaa the taak la very 
procaaeor-lntenalve. 

X can now be computed for three typical MPP operatlona: 8- 
blt Integer addition (computing a 9-blt reault), 32-blt 
floating-point addition, and 32-blt floating-point 
multiplication. 


5.3. 2. 1 I/O-CPU time ratio: Integer addition 


The reported speed of the MPP performing 8-blt Integer 
addition Is 6S53.6 million operations per second. There are two 
8-bit operands and one 9-bit result. The cycle time of the MPP 
Is 100 nanoseconds. Hence, 

6 

8-6353.6x10 

B-8+8+9-25 

-7 

T -10 
cpu 

6 

„ _ (6553.6xl0^)(25)( lO"^) . <6553.6x 10 )L^ ( 655 3 . 6x 10^ ) ( 25 ) 

* 128 16384 SI 

10 

-128+(4xlO^)L^+^^^^^5^2iO — 

If the data is transferred between the PE memories and the 
staging buffer, the input-output will take 128 times longer than 
the computation. If a secondary memory Is involved. Its latency 


must be very low and Its transfer rate very high. Figure 1 shows 
the dependence of X upon the disc transfer rate for two values of 
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For this r«lAtlvely- tlffiple op«ratlon, the procasslng time is 

swamped by the time required to perform input-output* For 

instance^ a disc system with L "ISms (fast by current standards) 

d 

g 

and R-IO Mb/e ( very fast relative to current technology), X is 
still high: 


.^4^^ ^ , V . 2 . 048* 10 

128+(4*10 )(*0L5) J' o 

10 ® 


- 6332.8 


This means that in order to achieve 50% processor utilization 
when performing integer addition with two input arrays and one 
output array, it is necessary to perform (6332 . 8 ) ( 0 . 50 ) • 3166.4 
array operations* 


5. 3. 2. 2 I/O-CPU time ratio: floating addition 


The reported speed of the MPP performing 32-bit floating- 
point addition is 470 million operations per second. There are 

two 32-bit operands and one 32-bit result. The cycle time of the 

/ 

MPP is 100 nanoseconds. lienee. 


S-470* 10 

B-32+32+32-96 

-7 

T -10 
epu 

Using these values, X can be computed as 


C 470«l0S(96)O0-^ . >'-d 

T28 16384 


("^70x 10 )(96) 


35.25-<-(2.87x10^)L^ ».1i 


If the data is transferred between the PE memories and the 
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staging bufftr, cha input-output will taka approxlaataly 33 tlaao 

longar than tha computation* If a aacondary mamory ia involvad, 

its latancy must ba vary low and ita tcanafar rata vary high. 

Flgura 2 ahowa tha depandanca of X upon tha diac tranafar rata 

for two values of L^« 

* 

While the input-output time is still much greater than the 
computation time, the difference is an order of magnitude less 
than in the 8-bit Integer addition case above; for example, given 
L^«15ms and R-lO^Mb/a, 

9 

X - 35.25 + (2. 87x lO^K 1.5x10“^) + . 522.15 


10 ' 


5 . 3 * 2 * 3 1/Q-CPU time ratio; floating multiplication 

The reported speed of tne MPP performing 32-bit floating- 
point multiplication is 291 million operations per second. There 
are two 32**bit operands and one 32-bit result. The cycle time of 
the MPP is 100 nanoseconds. Hence, 

6 

S-29U10 

B-32+32+32-96 

T -lo”^ 
cpu 

Using these values, X can be computed as 

(2.91xl0®)(96)( 10“^) (2«9lxl0 )L^ ( 2 . 9 1 x 10® ) ( 96 ) 

X . ^ 


- 21 .834-( 1 . 77 6x lu^ )L^-K— 

If the data is transferred between the PE memories and the 
staging buffer, the input-output will take approximately 22 times 


original 
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longer than tha coaputation* If a aacondary aaaory la involvadi 
Ita latancy auat ba vary low and Ita tranafar rata vary high. 
Flgura 3 ahowa tha dapandanca of X upon tha dlac tranafar rata 
for two valuaa of L . • 


I Tha gap batwaan tha Input^output tlaa and tha coaputatlon 

i 

tlma la narrower than in either of tha two caaaa above; for 

• 

i 8 

I. L^*lSma and R**10 Mb/a tha ratio la 

9 

I X - 21.83 + ( 1.776kIO^)( . 323.13 

* 10® 


5 . 3 . 2 . 4 I/O-CFU Tine Ratio; aeguenca of oparatlona 

In the previoua three caaea, X bean computed aaaumlng 

that each data item (128x128 array) la tranafarred Individually 
and used only onca. If Inataad Che data la ''apooled'' ~ 
tranafarred In blocka - the acceaa latancy will ba ''apraad out'' 
acroaa many more a lament a. For Inatance. if a block of M 128x128 
arrays of floating-point numbers la transferred In one operation^ 
K can be c'‘mputed as 

8 

V ( 2.91x10®)(Nx32)( lO"^) . (2*91x10 )L^ ( 2 . 9 lx 10^ )( Nx 32 ) 

* “ 128 16 38"4 Tr 

Note that now X la the number of operatlona that must ba 
performed on Che set of N arrays; the number of operations per 
transfer Is X/2]i« Figure 4 shows this value for transfers with 
different block sizes (^.£. values of N). 


Problems which are best suited to a parallel matrix 
processor are usually very computationally Intensive. It is an 
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Figure 4: Dependence of I/O-CPU time ratio on block size 


118 

•xtr«m« cas« to input two opnrandt, pnrfora an oparatlon, and 
output the reaulta* A far more common occurrence la to input the 
operands, perform many operationa upon them, and then output the 
results* As the previous sections have usscribed, X indicates 
the ratio of the input-output time to the computation time* An 
equivalent way of expressing this is that if N operations are 
performed per transfer, N<X, and input-output is completely 
overlapped with processing, the PE array utilltatlon will be N/X* 
(If N>X then the PE array utilization will be 1*) Figures 3 and 6 
illustrate the effect of the disc transfer rate (with 13 
millisecond access latency) and the number of operations 
performed per transfer upon the PE array utilization, for a 
sequence of floating-point multiplications* Figure 3 assumes 
that each 32-bit floating-point operand is transferred 
individually; figure 6 assumes that the data is spooled with a 
block size of 312 bit planes (sixteen 32-bit operands are 
transferred to/from the disc at once)* If the disc is 
sufficiently fast, the operands are spooled (transferred in 
blocks), and each value is used in many operations, a totally- 
overlapped input-output system can keep the PE array busy lOOX of 
the time* 

5. 3*2*3 I/O-CPU Time Ratios; Conclusions 

It is not particularly surprising that the MPP can process 
data much faster than it can perform input and output* However, 
the discrepancy between the input-output speed and the processing 
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Fifjure 3: Dependence of PE utllizacion on transfer rate: unspooled 
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•p««d can ba vary iarga. Thla auggaata aavaral thlnga. 

Firati It la abaolutaly oaaantlal chat data crauatara ba 
kapt to a inlnlmum* Data ahould bo road in and wrlttou out only 
onco If at ail poaalbla* It la far mora important to minlmiaa 
cha numbar of tranafars than to achlava maximum ovariap between 
input->output and procaaaingi for a ayatam which parforma tha 
minimal amount of input-output but doea not overlap that input- 
output with computationa ia at worat one-half as feet as the 
optimum system* 

Second, a high-speed secondary memory is absolutely 

easontial* The very optimistic figures for L. and K still 

d 

produced a high ratio of input-output time to CPU time* 
delivered, Che MPP will instead have a six megabaud link to Che 
host procossor. Unfortunately, although this link itself is slow 
relative to the MPP processor speeds, the limiting factor in this 
system will ba the memory systems which are attached to the host* 
Some standard disc subsystems for the VAX are listed in Table 1 
along with their average access times and transfer rates{2]* 
Considering Che analysis of tha proceeding section, discs with 
such high access times and low transfer rates (relative to MPP 
processing times) will severely limit the performance of the 
system* 
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T«bl« 1: Access Times and Transfer Races for DEC Discs 


disc 



access” time 
(milliseconds ) 

k 

transfer rate 
(kilobytes/ second) 

RWI 


T2T3 

8TT5 

RM0 3 


38.3 

1200 

RM80 


33.3 

1200 

RP06 


38.3 

806 

RM03 


38.3 

1200 

RP07 


31.3 

1300 

RP07- 

D 

31.3 

2200 


Third) transfers should be performed between the staging 
buffer and the PE memories whenever possible. The gap between 
the input-output time and the processing time is relatively 
narrow when only the staging buffer is involved. When sequences 
of complicated operations intensive floating-point 

calculations) are performed it will be possible to significantly 
overlap transfers between the PE array and the staging buffer. 

Finally) when it is necessary to transfer datu to the 
secondary memory it should be transferred in relatively large 
blocks. The dominating factor in the input-output time to the 
disc is the access latency; hence) it is desirable to transfer as 
much as possible when input-output must be performed. 

5.3.3 Implementation Alternatives 

There are two possible implementation schemes for data 
migration on the MPP. The transfer of data between memories may 
be handled by a memory-management system (and hence be 
transparent to the programmer) or it may be directly programmer- 
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5*3*3«1 AutOittatlc Data Migration 

Chapter 3 described the implementation of Parallel Pascal 
through the intermediate language Parallel P-code. One of the 
significant characteristics of Parallel P-code is i'.s stack 
orientation* The amount of memory (excluding dynamically 
allocated memory, which la runtime-dependent) which the main 
program and each function or procedure require (per call) can be 
determined by the code generator at compile- time • Given an 
unbounded memory size, the memory address of the next temporary 
location can be easily determined (it is the address following 
the top-of-stack) . 

Unfortunately, the available memory on the MPP Is not 
unbounded; on the contrary, it Is very small. If an automatic 
memory management scheme is to be used, some locations in the 
main memory must be shared by several different data items* 

Hence, the main memory, staging area, and secondary memory form a 
three-level memory hierarchy* 

Conventional machines often utilize memory hierarchies at 
two levels* The first is the addition of a hardware cache memory 
to supplement the bulk main memory of the machine* This is 
usually implemented solely in the hardware of the machine. The 
second level is the Implementation of ''virtual memory'', which 
migrates data from main memory to secondary (disc) memory. This 
allows a program to use a large (virtual) address space without 
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requiring that all of that addraas tpaca ba phyaically raaldant 
in main mamory at all flmea* Virtual mamory la typically 
performed by software, with appropriate hardware support. 

The MPP memory hierarchy does not greatly reaemble a cache 
memory system. Cache memories are usually an order of magnitude 
faster than the main memory of the computer and a few orders of 
magnitude smaller. For example, in the PDP'll/yo the cache 
memory has an access time of 0.3us and a capacity of 2 kilobytes, 
while tt.e (magnetic core) main memory has an access time of 
l.32us and a typical capacity of 512-1024 kilobytes. In the MPP, 
on the other hand, the time to access data from the staging 
buffer is approximately two orders of magnitude greater (it 
requires 129^ cycles rather than cycles to fetch ^ bits), while 
In the delivered version the main store is equal in size to the 
staging buffer. (Even with a full complement of memoi'y, the 
staging buffer will only be 32 times larger than the main 
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whan it la Accttaalbla tha program la raatartad. Whila tha 
tranafar la taking placa tha program la auapendad and anothar 
program la allowad to run* In the Ideal caae, tha proceaaor la 
always buay even though one or more programs la currently 
blocked. Tho MPP, however, la not multlprogrummed . When the 
program la blocked awaiting Input (or output) the array of 
processing elements Is Idled* As section 5*3*2 noted, Input- 
output (eapeclally to a secondary memory) Is very expansive* 

Without actual runtime experience It is difficult to predict 
the type of automatic memory management system which would be 
most effective* Without such experience it seems advisable to 
consider some qualities that such an implementation might 
possess, instead of attempting to fully define the 
implementation* 

First, because transfers between the staging area and main 
memory are the least expensive (for sequences of complex 
Operations such as floating-point arithmetic it will be possible 
to overlap most of the input-output with other computations) the 
staging buffer should be used to hold variables and temporary 
data which will be needed again, and the secondary memory should 
be used only for input of the original problem and output of the 
results (or. If necessary, overflow from the staging area)* 

Second, all transfers between the staging area and the 
secondary memory should consist of blocks of data. If necessary, 
data may be loading into the staging area before it is needed in 
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order to avoid later diac refereacaa (with their long aceeaa 
latenciee). This is analogous to demand prepaging in virtual 
memory systernsrsi* Some form of dati restructuring may be used 
to improve the clustering of commonly-used locations into 
contiguous locat ions 1 4 1 . 

Finally, the stack orientation of Parallel P-code can be 

used to advantage* When a function or procedure is called 

recursively the data locations corresponding to the previous 

activation of that routine become inaccessible (until the 

recursively-called routine returns)* Temporary locations at any 

« 

outer lexical level are also inaccessible until control returns 
to the routine which calculated them* If data migration is 
necessary, these locations could be used, especially those at the 
outermost lexical levels (for which the next reference will be a 
relatively long time in the future)* If memory planes are shared 
by several small arrays (those with dimensions less than the 
hardware array size), this method can still be used provided that 
memory planes are shared only by variables (or temporary data) 
within the same procedure activation. 

5 « 3 * 3 • 2 P rogrammer-Directed Data Migration 

The alternative to an automatic memory management system is 
a programmer-directed system. Such a system requires the 
programmer to be concerned with the implementation details; it is 
therefore less portable and somewhat more difficult to use. 
However, it has the potential for higher system performance since 
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no potonclally Inofficionc data tranafara ara parforaad ''bahlnd 
tha programmar'a back*'' Tha initial implamantation of Parallal 
Pascal on tha MPP will uaa this achooa, as dascribed in saction 
2*3 of this raport* 
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6: FAULT TOLERANCE IN HIGHLY PARALLEL 

MESH CONNECTED PROCESSORS 


6 . 1 ilntroduction 

The mesh intercoDDeetioD scheme has been used oo seversl Urge scale SIMD 
parallel processors. This scheme iovolva orgSDizing the proc^iog elements 
(PE’s) Into a two dimensional matrix such that each PE has data interconnections 
with its adjacent neighbors. In a typical organization a PE has connections to 4 
near neighbors in the cardinal directions N, S, E and W. In a single instruction 
data may be shifted in a single specified direction between all adjacent PE’s. 
That is, a distributed matrix of data may be shifted one mrah position in parallel. 
The main advantage of the mrah scheme is its simplicity and suitability for a 
large class of scientific applications. Data interconnections only occur between 
adjacent PE’s this means that they may be kept very short and laid out on a sin- 
gle interconnection plane. The usual dbadvantage with this scheme b that data 
transfers to distant PE’s require a large amount of time since the data can only 
cross between adjacent mesh nodes with each clock cycle. However, there .b a 
large class of problems including physical system modeling using partial 
differential equations and image processing in which the data needed by a PE b 
located in its local mesh area and the m^h interconnection scheme b very 
efficient. 

One potential problem with the m»h scheme b that the failure of any node 
in the mesh renders the whole parallel processor inoperable. Current LSI proces- 
sor designs involve a mesh with more than 10,000 nodes; with VLSI technology 
systems having 1,000,000 nod» and more may be anticipated. For some applica- 
tions, for example real-time image processing in a remote inaccessable robot sys- 
tem, some fault tolerance b essential. 

6 . 2 Mesh Connected Parallel Processors 

An important large scale mesh parallel processor b the Illiac IV (1| developed 
in the late 1060’s. It consbts of an 8 x 8 m^h connected set of 64 PEs; each PE 
having an .ALU with a 64-bit-wide datapath and floating point capabiliti^. Thb 
architecture is well suited to applications such as partial differential equations. 
The implementation of Illiac IV was hampered by the technology of the time. 
Il.ardwarp failures were anticipated to occur every few hours. The PE’s were regu- 
larly subjected to an extensive library of automatic tests and were replaced manu- 
ally if any faults were detected. 
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A more recent design, based on LSI technology, is the Massively Parallel Pro* 
cessor (MPP) (2,3| which is currently being developed for NASA by Goodyear 
Aerospace and should be constructed by late 1081. The MPP consists of a 128 x 
I2H mesh of 16,384 PE's; each of which has a l*bit-wide data path and can 
achieve floating point operations through bit-serial algorithms. This architecture 
is designed for image processing applications where a single image may be as large 
as a 6000 x 6000 matrix. The MPP processes such an image as a sequence of 128 
X 128 subimages. The MPP involves parity checks on each 8 bits of the PE local 
memories and has a redundant column of PE's which may be switched in by the 
host computer to replace a faulty column. With this fault tolerance the MPP is 
expected to run for several hundred hours before requiring manual intervention. 

The fault tolerance concepts in this paper will be considered with respect to a 
bit-serial PE array scheme or Binary Array Processor (BAP) {4} such as the MPP. 
These concepts may be extended to word parallel designs such as the Illiac IV 
type architecture. However, with the constraints that the matrix to be processed 
is larger than the PE array and that the algorithms to be implemented are well 
formed for the mesh organization, the bit-serial approach has significant advan- 
tages over the word parallel approach for equivalent amounts of hardware (5|. 

A general block diagram for a large scale BAP is shown in Fig. 1. Data pro- 
cessing is achieved with the array of PE's. Data is input to and output from the 
PE array via the I/O buffer memory which communicates the data to peripherals 
and bulk auxiliary storage devices. Instructions to the PE array are issued by a 
single high-speed microprogrammed control unit. The whole system synchroniza- 
tion is maintained by a conventional host computer which issues macro instruc- 
tions to the control unit. Some feature information may be extracted from the 
PE array by the global information extraction mechanism. 

A typical organization for an MPP-like PE is shown in Fig. 2. Data from 
adjacent near neighbors is selected by the NN multiplexor. The control lines and 
local memory address lin^ are broadcast to all PE's in the array. The OR bus is 
a line from all PE's to the control unit which has a one value if any PE outputs a 
one. The I register is used for data I/O; it receives data from the I register of the 
adjacent PE to the left and transmits data to the PE on the right. 

The I/O buffer memory is vital part of the BAP system, it is responsible for 
making reformated data available to the PE array. With the MPP a data matrix 
is input to the array as a set of bit-planes. Each bit plane is input along one edge 
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Fig. 2. A typical BAP PE organization. 
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of the PE array; one column with each clock cycle. Each row of the array acte as 
a shift register. When the complete bit plane has been input it is stored in the PE 
local memories in one clock cycle. Fault tolerance in the I/O Buffer can be 
achieved with the single error correction-double error detection (SECDED) 
schemes common in many recent large memory systems. For the PE array the 
I/O mechanism is like a one dimensional m<»h connection; and it will not be 
treated as a separate issue for fault tolerance. For a BAP system the I/O could be 
achieved by the mesh interconnection hardware; alternatively a separate but basi- 
cally similar I/O hardware as shown in Fig. 2 could be used, if necesisary, to avoid 
blocking. 

The global feature extraction mechanism on the MPP is an OR function over 
all PE elements, which outputs a 1 if any PE has a 1. If we have an error detec- 
tion mechanism then a similar global OR function would be needed to report an 
error to the host processor. Once again th^e two functions will be considered to 
be implemented with the same hardware in this paper; such a scheme is used with 
the MPP (3]. It has been suggested that a more powerful feature extraction 
mechanism, such as counting the number of bits set in a bit plane may be cost 
effective for future BAP systems [0]. 

The MPP PE array is constructed with two LSI chips, A PE chip and a 
memory chip. The PE chip contains 8 PE’s (without local memori^) in a 2 x 4 
array. Each PE chip has connections to 8 1-bit memories for the PE's and an 
additional 1-bit memory for a parity check of the other 8. The total PE array 
consists of 33 4-PE wide columns; each column consisting of 128 PE chips. A PE 
chip has a control input which, when activated, disables the chip by connecting 
corresponding East West pin data lines together. In this way any one of the 33 
columns may be disabled to achieved on operational 128 x 128 PE array. When 
an fault is detected the faulty column is disabled and the redundant column is 
used to replace it. 

Faults will be considered here to be of two basic types - local and module. A 
local fault may typically be a broken data line or a faulty memory bit whereas a 
module fault implies the complete failure of a module, such as a chip, which may 
result in a set of related PE’s being made inoperable. 

Since we are dealing with functionally very complex chips the probability of 
a local fault may be expected to be significantly higher than a module fault. 
Therefore the main effort of the work here is concerned with local faults as they 
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are much simpler and cheaper to deal with. However, any practical very large 
scale mesh connected processor also needs some fault tolerance at the module 
level. 

For the MPP, the redundant column scheme is effective for any single 
memory chip failure. It is abo effective for most PE local failures, e.g. if a data 
line breaks in a PE. Therefore the most probable fault causes have been covered. 
However, if a catastrophic failure occurs to a PE chip, (module) then the whole 
PE array may become inoperable since it b necessary for data to flow through a 
disabled chip. 

6 . 3 A VLSI PE Array Ogganization 

For a VXSI system design there are two fundamental chip size limitations (1) 
the number of devices which can be put onto a chip and (2) the number of pin 
connections which may be made to the chip. A usual characteristic of a Vl/SI 
design b modularity, i.e. a chip consbts of a very large number of identical 
modules, which b important to minimize the development cost. Finally, with 
very large functionally complex chips fault tolerance may be effective to 
significantly increase the production yield and the fault free lifetime of the chip. 

A possible chip organization for a very large VLSI PE array b shown in Fig. 
3. Three different VLSI chip types are involved; a PE ALU chip, a local memory 
chip and a PE mesh interconnection chip (\flC). 

The PE ALU chip consbts of a set of PE ALU’s, each having a limited 
amount of local memory. These ALU’s share common ALU*function and address 
lines but do not have any data interconnections between each other. The data 
access to a PE is achieved by a single pin on the chip which b connected to a 
bidirectional bus line. The design of an effective PE with thb input/output con- 
straint b described in [5]. With thb design optimal bit-serial processing times for 
addition, multiplication and logical opeiations can be achieved. The limited size 
on-chip local memory may be used for table-look-up applications since it may be 
addressed by an ALU regbter (unlike the external local memory) or for a cache 
memory. 

The PE .\LU chip will be a functionally very dense chip and will contain as 
much logic as the VLSI technology will allow. There are no pin connection prob- 
lems since only one pin b required for each additional PE. 
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The extero&l local tnemory chip wiU provide the main local data atorage for a 
i L). With the amount of single chip storage which u becoming available with 
emerging memory technology, it is possible that one VLSI memory chip could con- 
tain adequate storage for one or even several PE's. The 1-bit wide external PE 
memory is connected to the single-bit PE data bus. Once again the limitation 
with the memory chip is caused by the functional complexity achieveable with the 
VLSI technology; there b no pin connection problem. 

The interconnection chip realisa the mesh interconnections between the PE's 
and also contains an input/output mechanbm for data I/O to the PE array. 
Unlike the previous chips thb chip b functionally very iimple and the size of the 
mesh which can be contained on a chip b limited by the maximum possible 
number of pin connections. Each mesh node requires one pin connection to its PE 
data bus and abo, for a m x n mesh, 2(m+ n) additional data interconnections are 
needed to .adjacent MlC's. 

.\11 the additional logic to achieve error reconfigurability for the PE array b 
located in the MIC's. 

^ • ** PE Fault Tolerance 

Both the ALU and external memory chips are functionally very complex and 
therefore more likely to fail than an MIC. In thb section we consider how to 
reconfigure the array if a single PE-externiJ memory combination fails. This 
reconfigurability b achieved by modifying the MIC so that it has access to spare 
PE's which may be switched in to replace the faulty PE. 

A basic non-fault-tolerant VOC organization for a 2x2 mesh subsection of a 
PE array b shown in Fig. 4. Thb chip has a total of 12 data pin connections; 1 to 
each of the 4 PE’s and 8 to adjacent neighbor MIC's. In ganeral a m x n mesh 
MIC would require mn-f 2m+ 2n data pin connections. 

The basic logic device which the draign of the MIC will be based on b the 
selector which is illustrated in Fig. 5(a). A selector has a set of control inputs, C, 
which specify by a binary code which of the X data items is to be connected to the 
Y data line. Once connected, data may flow in either direction from X to Y or Y 
to X. With some logic technologies an additional control input may be needed to 
specify the data flow direction. However, with designs considered here, the direc- 
tion information is always locally available therefore this additional control line is 
no problem. 
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Fig. A simple 2x2 mesh incerconnecc chip organization. 
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Each mesh node of the simple MIC shown in Fig. 4 may be implemented with 
a 5-way selector and a 1-Hit register as shown in Fig. 5(b). Two clock cycles are 
required, with this design, to transfer data between adjacent PE's. In the first 
clock cycle the data is output from the PE and loaded into its mesh node P regis- 
ter. Then, in the second clock cycle, the PE reads the value of an adjacent PE's P 
register. Data may be transferred between more distant PE’s by shifting it 
through a connected sequence of P registers. In general, a data transfer through a 
path of K stages requires K+ 1 clock cycl^. 

To achieve fault tolerance to a single PE failure we first consider adding a 
spare PE to each MIC grovp. A possible organization for such a reconfigurable 
MIC is shown in Fig. 6. Each m^b node may be connected to one of two PE's. If 
any PE fails each mesh node can be connected to a unique, operational PE. 

The details of the modified m»h node design are shown in Fig. 7. A Q regis- 
ter and a 2-way selector have been added to each node. The value of the Q regis- 
ter specifies which of the two possible PE's the mesh node is connected to. When 
a faulty PE is identilled the host computer generates a bit mask which is distri- 
buted to the P registers; then the Q registers are loaded from the P registers to 
isolate the faulty PE. The task which was in progress when the faulty PE was 
detected must be reloaded or r^tarted. 

The above MIC modifications require only two new pin connection to the 
MIC. One b the load control and the other is the data connection to the extra 
PE. One extra PE must be available to each MIC; however, it is possible for 
MIC’s to share a PE as indicated by the broken lines in Fig. 6. In this case only 
one extra PE for a group of MIC’s is needed. 

The above technique is easily extended if protection against more than one 
faulty PE for each MIC (or group of MIC's) is required. For example, protection 
against any two faulty PF > could be achieved by connecting two extra PE's to 
the MIC as shown in Fig. 8. The PE selector at each mesh node must select 
between three PE's, and the Q register must be extended to contain two bits of 
information. In the general case, fault reconfiguration for the up to K faulty PE's 
requires K extra PE's; each MIC requires K-t- 1 more pin connections than for no 
protection. Each mesh node in the MIC must contain a K+ 1 way PE selector 
and a Q register large enough to address it. 
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Fig. 6. A 3 X 3 matrix of interconnection nodes connected to 10 PE's. 



Fig. 7. A me 


sh node with PE fault tolerance. 
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Fig. 8. Organization for fault tolerance to any two faulty PE'ts. 
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6 . 5 MIC Mesh Node Fault Tolerance 

ODce the syatem mey be reconfigured for any fnulty PE, the neit problem 
area is the very liurge mesh interconnection network itself. Fault tolerance is con- 
sidered here for the failure of any mesh node or data interconnection in the inter- 
connection network. 

M»h node fault tolerance on the MIC can be achieved using a similar scheme 
to the MFP global fault tolerance. That is, have a spare column of mesh nodm 
which may be utilized when a faulty mesh node b detected. One way in which a 
spare column of mmh nodm may be incorporated within an an MIC b illustrated 
for a 2x2 m^h MIC in Fig. 9. In the general case with n+ 1 columns the 
configuration b specified by a regbter (not shown in Fig. 9) having two bits for 
each column. A possible organization of a m^h node for thb organization b illus- 
trated in Fig. 10. The two bits from the reconfiguration regbter are represented 
by RL and RR. When RL b set it specific that the left (W) input to the node 
not be connected to the adjacent column node but to the next node to that, i.e. to 
skip the node to the left; RR specifies which column node b connected to the (E) 
input in a similar way. A bit pattern b loaded into the reconfiguration regbter 
such that one column b skipped. It dom not matter what valum a dbabled faulty 
node may have on its interconnection lin^ since these lin^ are never used by the 
other nodes. For the rest of thb paper the configuration in Fig. 10 will be con- 
sidered to be implemented by a single 7-way selector. 

Any external data interconnections pin may be connected to one of two mesh 
nodes; therefore it b necessary to have a 2-way selector associated with each node 
as shown in Fig. 9. The control for these selectors b derived from the 
reconfiguration regbter contents. 

The simple organization shown in Fig. 9 can reconfigure for any faulty mesh 
node however, there b no fault tolerance from either a data pin connection failure 
or a data pin selector failure. Fault reconfigurability for such failures may be 
achieved by adding spare pin connections and selectors, one for each of the four 
directions of data connection as shown in Fi|:. 11. Only the connections to pin 
selectors are shown, the interconnections between mesh nodra b similar to Fig. 9. 
This organization assumes that the MiG’s are themselves connected in a matrix. 
Now if any pin selector or connector faib the two remaining pin connections may 
be used. The MIC connected to thb chip must abo use the same data connec- 
tions, therefore, we have fault tolerance to any single selector or data connection 
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Fig. 9. MIC organization for fault tolerance to any single mesh node 
failure. 
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Fig. 10. A column reconf igurable mesh node. 
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Fig. 11. Pin cotmactor and pin selector organization for complete data 
line fault tolerance. 



/ 
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failure between the interface of two MIC ehipe. In the general cace, this fault 
tolerance requires four extra pin connections and pin selectors. Furthermore, 
except for selectors at the end of the rows or columns, column pin selectors are 
way and row pin selectors are ^way. 

To complete the reconflgurable mesh node design the PE’s must be connected 
tc the enabled mesh nodra. One way of doing this is illustrated in Fig. 12. For a 
2x2 active mesh there is one extra column of mesh nodes and one extra PE. Since 
any mesh node may fail the node>PE selector is associated with the PE rather 
than the mesh node in contrast to the PE-only fault tolerance shown in Fig. 6. In 
the general case, each PE must be connected to a m«ih node by either a 2-way or 
a 3-way selector to the m«h nodra. 

Finally, we note that there is a simple extension to this scheme to achieve 
reconfiguration for any two faulty m«h no^^. This may be done by having 
either two extra columns on one extra row and one extra column. In either case 
all the mesh nodes require an additional two data connections. In general a 
second spare column will be cheaper than an additional row. For example for an 
n X n MIC a spare row requires n-f 1 mrah nod<» whereas a spare column only 
requires n mesh nodes. 

6 • 6 Module Fault Tolerance 

Fault tolerance to catastrophic chip failure, such as a broken power line or 
command line may be achieved by organizing the total array into a set of 
modules. Each module contains a set of related PE’s and fault tolerance is 
achieved by having a spare module available when one fails. 

For the discussions in this section an example array design will be considered, 
however, the technique discussed here are general in nature and may be applied 
to system with very different design parameters. The example system could be 
constructed with pr^ent day technology and is for a 1000 x 1000 PE array. The 
three PE chip types have the following characteristics: each PE chip contains 16 
PE's each MIC contains a 4x4 matrix of active mesh nodes and there is a memory 
chip for every 4 PE's. F ault tolerance will be considered at two module levels (a) 
Che chip level and (b) at the group level where each group consists of a set of 
chips. 
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Fig. 12. PE-mash node interconnections for both PE and mesh node fault 
tolerance. 
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I) . 7 ChiD Level Fault Tolerance 

A catastrophic fault could occur Ui any chip, the PE chip if conaidered first. 
The smallest possible group size cousbts of the following, ore MIC, one PE chip 
(plus an extra PE for fault tolerance) and 4 memory chips (plus one bit for fault 
tolerance)- This group, therefore involves a 4x4 matrix of active PE's. If the PE 
chip fails then it b nectary to replace the whole group. 

.\s an alternative, a larger group may be used involving 256 active PE's in 
which the PE chips are dbtributed between the MIC's. Thb group consists of 16 
MIC’s, 17 PE chips and 68 memory chips. Each PE chip contribute one PE to 
every MIC in the group; therefore, sbice each MIC has fault tolerance to one 
faculty PE, a single PE chip failure can be handled locally within the group. The 
cost of PE chip fault tolerance within the group b more complex inter-chip data 
routing. 

The total failure of a memory chip would render 4 PE's inoperable in our 
example. However, the memory chips may be dbtributed between MIC's in a 
similar way to PE chips in order to achieve group local fault tolerance. 

If an MIC faib completely then the whole group b rendered inoperable. 
Therefore we need a mechanbm to selectively enable a group in the total array. 
One approach b to make a group in the form of a column of PE's and have a 
spare column of PF 4 a similar scheme to the MPP. In our example design, a 
group could be orga a 64x64 matrix of PE's and 16 groups would consti- 

tute one 4 PE- wide cedun^tt of the PE array; 256 such columns would be required 
for the complete PE arvjsy. 

To allow for the dbabling of a column each MIC needs to have a spare set of 
data selectors and data pin connections for all data lines in the E/W directions. 
Thb spare set would bypass the adjacent column and link with the spare data 
connections on the following column. In thb way any single column may be com- 
pletely bolatcd. Since there are 256 active columns it might be advantageous to 
have more than one spare columns. Then multiple MIC chip failure could be 
dealt with ns long as they do not occur in adjacent columns. 
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6 . 8 Cost of MIC Fault -Tolerance 

The cost of implemeDting the fault tolerance schemes with the MIC has been 
estimated using three measures (1) the number of selectors, (2) the number of 
internal data lines and (3) the number of data pin connections. The number of 
selectors may be considered as a measure of the functional complexity of the chip. 
No weight is attached to the complexity of each selector and the small amount of 
control logic is not considered. Although the mesh node selectors are more com- 
plex for the fault tolerant design this is balanced by the many additional simplier 
selectors which are used for PE’s and data pins. The number of internal data 
lines is also a measure of chip complexity since they may consume a large propor- 
tion of the chip area. The data pin count gives a good indication of the pin 
requirements of the MIC since less than 10 additional pins will be required for con- 
trol and power. 

The costs for various fault tolerant MIC condgurations are expressed for a m 
X n mesh design in Table 1. The 6rst row in Table 1 is the cost for the simplest 
MIC without any fault tolerance. The second row is for single PE fault tolerance 
as shown in Fig. 7. The third row is for an MIC with complete single mesh node 
and data interconnection fault tolerance as illustrated in Figs. 9-12. The last two 
rows include the cost of an extra sev of left and right data connections and selec- 
tors for group fault tolerance. The Brst set of figures is for a spare column within 
the \flC and the second set of figures is for a spare row within the MIC. The 
spare row concept is slightly cheaper than a spare column for a square mesh i.e. 
when n m. 


Table 1: MIC cost for an n x m active mesh 


Fault Selectors Internal Data Data Pin 

Tolerance Lines Connections 


None 
single PE 
single mesh 
node 
group (a) 
group (b) 


mn 

2mn 

2mn+ 2m+ 3n+ 5 

2mn+ 4m+ 3n+ 7 
2mn+ 5m+ 2n+ 7 


3mn4* m-f- n 
4mn+ m+ n+ 1 
6mn+ 7m+ 7n-l 

6mn+ 17n+ 7n-0 
6mn-H 15m-f 7n-5 


mn+ 2+ 2n 
mn-l- 2m-h 2n+ 1 
mn-f 2m-f 2n-f- 5 

mn+ 4m+ 2n+ 7 
mn+ 4m-f- 2n+ 7 
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These costs are shown ip'aphically in Figs. 13>15. In Fig. 13 the number of 
selectors for the MIC is shown. While a significant increase in selectors is needed 
for fault tolerance the chip is still not very complex. For the example system of 
16 active nod^ only 67 selectors are needed for group fault tolerance; when 
n=ma:io, i.e. 100 active nodn, only 277 selecton are required. In Fig. 14 we see 
that there is a large increase in internal data lines when mesh fault tolerance is 
introduced, however the total number of data lin» is still quite reasonable. The 
limiting size design parameter for the MIC is the number of pin connections; in 
Fig. 15 it is shown that there is only a small increase in total pin connections 
when fault tolerance is introduced. For the example 16 node MIC 21 data pins 
are required without fault tolerance and 34 data pins are required for group fault 
tolerance. When there are 100 active nodes on the MIC the data pin requirements 
are 140 without fault tolerance and 167 with group fault tolerance. 

6 . 9 Fault Detection 

The MPP PE chip for 8 PE’s has an additional memory chip for parity infor- 
mation. This mechanism provides good fault detection for the local memory 
chips. With the VLSI chip organization proposed here a similar mechanism could 
be implemented. In this case the MIC would monitor all PE local memory reads 
and writ^ and store the parity in a separate memory chip. For fault tolerance 
each MIC may select between two parity memory chips and one spare parity 
memory chip for each group would be required. 

An alternative fault detection scheme is to use additional parity bits with 
each data operand. The advantage of such a scheme is that data parity may be 
checked after any data transfers, either I/O or interprocessor, in addition to any 
memory data transfers. For a bit-serial system this could be implemented with 
very little hardware in the PE ALU. A single 1-bit parity register and an 
exclusive-OR gate as shown in Fig. 16(a) is all that is needed for a multibit regis- 
ter ALU. As each operand is read its parity is computed in the T register; then all 
T registers are output to the global OR function which will report any parity 
errors back to the host processor. The T register is selected by the local memory 
address mechanism for setting it to an initial value or reading its contents; there- 
fore, no additional pin connections to the ALU chip are required. This same 
mechanism is used to generate the parity when a result is stored in local memory 
or transferred to another PE. 



Htud>er of Selectors 
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Fig. 13. MIC selector cost for an n x n active mesh, (a) no fault 
tolerance; (b) PE fault tolerance; (c) mesh node fault 
tolerance: (d) group fault tolerance. 
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Fig. 14. Internal MIC data lines for an n x n active mesh; (a) no fault 
tolerance; (b) PE fault tolerance; (c) mesh node fault toler- 
ance; (d) group fault tolerance. 
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Fig. 13. MIC daca pin connections for an n x n active mesh, (a) no fault 

tolerance; (b) PE fault tolerance; (c) mesh node fault tolerance 
(d) group fault tolerance. 
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(b) 


Fig. 16. Bit-serial parity hardware. 

(a) for a PE with multi-bit registers. 

(b) for a simple PE with 1-bit registers. 
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With simple bit>serial PE ALU’s having only l>bit registen it is necessary to 
interleave the fetching and storing of operand bits. For example, an integer add 
operation requires first the least significant bit of the operands to be read and the 
first bit of the result is stored, then the next least significant bits are dealt with 
and so on. To deal with this data flow the parity hardware shown in Fig. 16(b) 
may be used. Three parity registers are used in this case; T1 and T2 compute and 
check the parity of the two operands while T3 compute the parity bit for the 
result. Two extra pin connections are needed to each PE ALU chip to specify 
which operand the current bit on the data bus belong^ to. 

An important feature of tagging data with parity bits as described above is 
that, like other bit serial operations, the data format including the number of par- 
ity bits is completely user programmable. The cost of a large amount of parity 
checking is a reduction in effective local storage and an increase in processing 
time. The cost of the parity checking is proportional to the number of data bits 
associated with each parity bit. For 32-bit operands this cost is fairly small i.e. in 
the order of Z% loss of storage and increase in processing time while for 1-bit logi- 
cal data this cost may be 100%. The user has the freedom to select where and 
how much parity checking is to be done. 

Once an error has been detected, either through data parity or by running 
diagnostic programs for the PE’s the host processor must reconfigure the PE 
array. It isolate the problem by finding which column, and when possible, the 
node PE which is the source of the error, and generate the bit masks, using the 
PE array when possible, to reconfigure the array. 

6.10 C one lus ion 

The problem of fault tolerance in highly parallel m»h connected processors 
has been considered and methods of protecting against the most probable faults in 
the PE array have been proposed. 

Fault tolerance at different levels has been considered. It has been shown 
that fault tolerance to the most error sensitive components, i.e. the functionally 
complex PE ALU’s and local memory chips, may be achieved at a low cost at the 
local level. More extensive but less common errors such as catastrophic chip 
failures, broken command lines including a faulty OR bus line, usually need to be 
dealt with at the more expensive module or column level. 
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The cost of a high degree of fault tolerance can be achieved with a moderate 
amount of additional hardware. Such hardwue may become a very important 
part of VLSI PE arrays having 1,000,000 or more nodm or in applications for 
smaller arrays in situations where high reliability and fault tolerance is nectary. 
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Parallel Pascal, an Initial high level language tor the MPP, has been 
specified with surprisingly few extensions to the base Pascal Language. It 
has been implemented on conventional computers via a translator and. for the 
MPP. the front end of the compiler which generates Parallel P-Code has been 
developed. The suitability of the language notation has been demonstrated 
by program examples of typical MPP algorithms. Several useful algorithms have 
been developed for the MPP including fast median filtering and efficient, 
bit-level arbitrary function implementation. 

Other high level languages have been specified for the MPP including a 
Parallel Fortran and a Parallel APL. The availability of the Parallel P-Code 
language and a code generator for Parallel Pascal should greatly simplify the 
construction of a compiler for these languages; only a front end which compiles 
the source language into Parallel P-Code is needed. If Parallel P-Code is 
used as common intermediate language then programs written in one language 
will be able to call functions and procedures written in a different language. 

A considerable effort was made to carefully design the Parallel P-Code 
language. This intermediate language is at a hioher level than conventional 
P-Code since it must deal with the more complex environment of a parallel 
matrix processor; i.e.. a host processor and a PE array. Arrays and record 
data structures are described by descriptors rather than offsets so that the 
selection of the memory system on which they reside may be made by an optimizer 
or code generator. Code generators may be based on Parallel P-Code for many 
other parallel processors in addition to the MPP. The linear format of 
Parallel P-Code is a carry over from its P-Code compiler origins. The experience 
gained from developing Parallel P-Code suggests that for a future intermediate 
language a parse tree structure format might be more appropriate. This is because 
of the many different data aggregate structures which occur internally in a 
parallel language program. 
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Architectural extensions to the MPP PE desiqn have been proposed which 
would greatly enhance the PE's performance. The construction of such PE's 
Is feasible with todays VLSI technology; furthenm}re » much larger, fault 
tolerant PE arrays could now be constructed. The performance of the Initial 
MPP for the benchmark Image processing tasks will obviously be strictly limited 
by the completely Inadequate Input and output facility of the host VAX 
computer and disk. If the I/O problem Is solved then the next bottleneck 
Is the small 1 K bits of local PE memory. However, there Is an Important 
set of large scale scientific tasks which could be efficiently Implemented 
on the MPP (with a larger PE local rremory) which are so computation Intensive 
that the current I/O speed would not be a problem. Without a larger local 
memory, a MPP with a high speed disk, large staqinn buffer and efficient 
spooling mechanism may offer an alternative solution for these tasks. 

Pascal was chosen as the nost suitable base language, however. It also 
has some problems. The major problems, which were Inherited by Parallel 
Pascal, are (a) user defined functions and procedures are constrained by strong 
typing to operate only on a single specified array size and (b) there Is no 
separate compilation facility. 

The fixed array size problem Is a very frustrating limitation that makes 
the Implementation of general purpose library functions very difficult. There 
have been many solutions proposed for this problem, one of the best of which 
is the conformant array schema which has been proposed for the next Pascal 
standard. This feature allows the actual index ranges of an array passed as 
an argument to be determined at run-time. However, the rank (number of 
dimensions) of the arrays is still fixed at compile time. If this becomes a 
standard then it could be incorporated into Parallel Pascal without any 
problems. For efficient compilation a further minor restriction may be that the 
set of all possible subranges to be passed as arquments should be determinable 
at compile time. Other possibilities exist before a Pascal standard is 


158 


established; for example, a preprocessor could be used to make multiple 
variant copies of a procedure which Is called with different size arguments. 

The lack of a separate compilation facility means that libraries cannot 
be used In the usual way. Also, large programs take a long tine to compile 
since all functions and procedures must be recompiled with every compilation. 
Several proposals have been made for an external function and procedure 
facility for Pascal but none has yet been accepted as a standard. Once a 
standard Is developed It should not be difficult to incorporate Into Parallel 
Pascal. Until a standard Is developed a mechanism (preprocessor) should 
be developed for Parallel Pascal which enables the use of libraries and also 
permits library functions to deal with different sized arrays. It may be 
possible to store the libraries In Parallel P-Code form^ther much of the 
recompilation overhead for large programs can be avoided. 

The Parallel Pascal Implemented on the MPP will have the Initial restriction 
that the last two dimensions of parallel arrays must have 128 elements to 
match with the MPP PE array size. The Initial Implementation of Parallel 
Pascal may have some further restrictions to simplify and speed the develop- 
ment of the MPP code generator. More work needs to be done in this area. 

An optimizer should be developed to make Parallel P-Code generated by the 
compiler more efficient for the MPP. Also other design options for the code 
generator should be explored such as the location of program code (MCU or 
VAX) and the prefetching of data from secondary storage. These options may 
be better explored once the initial code generator and the MPP are operational. 

In order to make the MPP a more complete system which 1s convenient to 
use a library facility needs to be added to Parallel Pascal, as mentioned 
above, and a library of commonly used, efficiently coded functions needs to 
be developed. Initially this library could be very quickly established with 
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rwximum flexibility by programming the functions In Parallel Pascal. Then 
key functions should be recoded In assembler language In order to achieve 
the maximum performance of the MPP. Since tne MPP will frequently be 
limited by the Input/output requirements the library should Include an I/O 
spooling system such as the one described In Section 2 . 
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APPENDIX A! PARALLEL PASCAL SPECIFICATION 


A. 1 Overview 

Parallel Pascal la a hlgh**level language, based upon 
standard Pascal, for parallel matrix processors. The philosophy 
of the standard language was a major factor In th<? choice of 
extensions. In the following description of Parallel Pascal, 
familiarity with standard Pascal Is assumed. [Standard Pascal Is 
described in reference I. A more recent definition Is given In 
reference 2 . ] 


A . 2 Declarations 

Each program, procedure, or function block In a Parallel 
Pascal program consists of a (possibly empty) set of declarations 
followed by a set of instructions. The declarations are grouped 
together according to their function: statement label 
definitions, constant definitions, type definitions, and variable 
def inltlons « 


Parallel Pascal uses the same syntax as standard Pascal for 
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chess de£ Inl C ions ; however, two extensions are provided: subrange 
constAnCs and parallel array types. 

A. 2. I Constant Subranges 

Standard Pascal uses the syntax 
const 

Tdentlfler ■ value; 

to associate ''value'' with the named ''Identifier''. In 
standard Pascal, ''value'' must be either a literal or a 
(possibly signed) previously-defined constant Identifier. 

Parallel Pascal extends the definition of a constant to 
include a constant subrange . Constant subranges are used in 
array indexing (described below). Effectively the definition of 
an identifier as a constant subrange associates two values with 
Che identifier - conceptually these represent a consecutive range 
of values. The syntax is: 

const 

identifier ■> low .. high; 

where ''low'' and ''high'' are either literals or (possibly 
signed) previously-defined constant identifiers. As an example: 

const 

mpplow - 0; 

mpphigh « 127 ; 

mppidx ■ mpp low .. mpphigh ; 

associates the integers 0 and 127 with the identifier ''mppidx''. 
When used in an array indexing expression, ''mppidx'' represents 
the ordered set of integers ( 0 , I, 2, ..., 126, 127). 
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A.2t2 Parallel Array Typea 


Standard Pascal specifies an array type definition with the 
syntax : 

type 

newtype ■ array [ indextype ] aeltype; 

where ''newtype'' is the name of the new array type^ 

' ' indexrange ' ' is a type expression (either a subrange or a 
scalar type) defining the type of the indices» and ''aeltype'' is 
the type of the array elements^ 


On a parallel matrix processor, it is common to store some 
arrays on the non-parallel host machine (or In the scalar control 
unit) and some in the (parallel) hardware array. Parallel Pascal 
provides the reserved word parallel to allow the programmer to 
specify the memory in which an array should reside. A parallel 
array is defined with the syntax: 

type 

newtype » parallel array [ indextype ] of_ aeltype; 


Aside from the memory in which they reside, parallel arrays 
and ''ordinary'' arrays are treated identically in Parallel 
Pascal. The parallel keyword exists only to provide a means for 
the programmer to give the compiler a ''hint'' as to a variable's 


usage 
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A. 3 Array Expraesiono 

The principle difference between standard Pascal and 
Parallel Pascal Is that Parallel Pascal permits the specification 
of array expressions* In other words, arrays may be added, 
multiplied, compared, etc* as aggregate quantities rather than 
element-by'*element * In order to deal with arrays, and sections 
of arrays, as aggregate units. Parallel Pascal provides 
extensions to standard Pascal's array Indexing mechanisms* 

As In standard Pascal, a scalar (non-array) expression may 
be used as an Index* Optionally, a subrange constant may be 
added to the scalar expression* The subrange addition Is 
specified by the special operator ''Q'' to prevent ambiguity when 
a compiler (or human) Is parsing the program* The subrange 
constant may either be an Identifier defined with a const 
statement (see above) or a literal subrange - two constants 
separated by the symbol ''*.''* If the scalar expression is zero 
it may be omitted* As an example. If the array ' 'm' ' Is defined 
by : 

var 

m; array [1**10] of^ Integer; 

1: Integer; 

then the expression 

m[i9l. *5] 

specifies the following subset of ''m'': 
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m[i+l] m[i+2] m[i+3] m[i+4] m[i+5] 
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Finally, if it is desired to select the entire range of an 
index, the index expression may be omitted entirely* Hence, for 
''m'' defined above, either of the expressions 

m [ ] or m 

will select the entire range of the array* 

Parallel Pascal also provides a mechanism whereby the 
Individual bits of an integer array element can be accessed* 

This mechanism is known as bit indexing * Since the form in which 
numbers are represented varies widely from machine to machine, 
bit indexing is inherently a very non-'por table feature; however, 
the avallablllity of this feature may allow the programmer to 
avoid the use of assembly~language code which would be even less 
portable and more difficult to write, debug, and maintain* The 
bit index follows the ''regular'' Indices and is pre:;eeded by a 
colon : 

arr[2,3:4] - select bit 4 of arr[2,3] 

arr[:0] - select bit 0 of all elements of ''arr'' 

Sirs are numbered from zero, with bit 0 considered the lowest- 
order bit. 

In order to prevent ambiguity, when arrays are used together 
in an expression they must be conformable * (Additionally, the 
array elements must be type compatible, as in standard Pascal*) 
Two arrays are conformable if they have the same rank (number of 
dimensions) and the same shape* Additionally, if the index 
ranges of the arrays are not identical, then the non-matching 
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index renge(e) of at least one of the arrays must be expilctly 
specified. Table 1 illustrates the conf ormub i 1 1 t y of two arrays 
of the same size but with different index ranges. 


Table 1: Conf ormability Examples 


5 srray [1..5] o£ integer; 
b: array (0..4] o^ integer; 


a 

5 — 

not conformable 

( implied ranges 

do not match) 

a 

b[G0. .4) 

conformable (explicit range for 

"b" ) 

a I I . . 2 1 
a [ 1 . . 5 1 

b 

b[^30. .4] 

not conformable 
conformable 

(shapes do not 

ma t oil ) 


A. 4 Standard Functions and Procedures 
A . 4 . 1 Elemental Functions 

Standard Pascal defines a number of standard functions to 
perform input/output, type conversion («.£,• truncating a real to 
an integer), and to perform common mathematical computations 
(£.^. cosine function). Parallel Pascal considers these 
functions to be ''generic'' in the sense that they may operate 
upon an array of any shape. For these functions, called 
''elemental'' because they treat each element of the array 
Independently, the value returned by the function is the same 
stiape as the function argument. For example, given the 


defi. nitlons : 
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var 

alnt: array [1..10] o_f real; 
angle: array [1..10] o£ real; 

the following computes the sine function for each element of 
''angle'' and stores the result In the corresponding element of 
' 'sine" : 

sine sln(angle); 


Table 2 summarizes the elemental functions. 


Table 2: Elemental Functions 


syntax 

meaning 


type conversions 

trunc(x) 

truncate real to Integer 

r ound ( X ) 

round real to Integer 

ord ( X ) 

ordinal value of x (for scalar types) 

chr ( X ) 

character with ordinal value x 


arithmetic functions 

abs (x) 

absolute value 2 

sqr(x) 

square (l.e. x ) 

sqrt (x ) 

square root •^) 

exp ( X ) 

exponential (l.e. e ) 

ln(x) 

natural logarithm 

s ln(x ) 

sine function 

cos (x ) 

cosine function 

arctan(x ) 

arctangent function 


miscellaneous 

odd ( X ) 

boolean: true If x is odd 

eof ( f ) 

booleftn: true If at end'-of-file on file f 

eoln ( f ) 

boolean: true If at end-'of-llne on file f 

succ ( X ) 

successor of x (if defined) 

pred( X ) 

predecessor of x (If defined) 
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la addition to the elemental functions, Parallel Pascal also 
provides some ''transformational'' functions, so named because 
they perform transformations upon the entire array rather than 
e lement**by-element • Table 3 summarizes the transformational 
functions, which are discussed In more detail below. 


Table 3: Transformational Functions 


8 yntax 




meaning 


fshlf t ( array , Si 


, . . . , Sn ) 

end -of i 

shift 

data 

within array | 

rotateCarray , SI, S2, ...» Sn) 

circularly rotate 

data within 

array 

expandCarray , dim. 

size ) 

expand 

array 

along 

specified 

dimension 

transpose(array 

, ''l 

, D2) 

transpose two 

dimensions of array 

sum( array , D1 , 

D2, 

• • • 1 Dtl ) 

reduce 

array 

with 

arithmetic 

sum 

prod(array, D1 , 

D2, 

• • • 1 Dn ) 

reduce 

array 

with 

arithmetic 

product 

alKarray, Dl, 

D2, 

• « • 9 Dn ) 

reduce 

array 

with 

Boolean AND 


any(array, Dl , 

ESI 

. . . , Dn ) 

reduce 

array 

with 

Boolean OR 

. 

max( array , Dl , 

D2, 

• • • 9 Dn ) 

reduce 

array 

with 

arithmetic 

maximum 

mlnCarray, Dl , 

D2, 

• • • • Dn ) 

reduce 

array 

with 

arithmetic 

minimum 


The functions ''shift'' and "rotate'' are used to move data 
within an array. These two functions have the same syntax; they 
differ in that ''shift'' performs an end-off shift of the array 
(with zeros shifted in at the other end) whereas ''rotate'' 
performs a circular rotation along the specified dimensions. The 
function call specifies the array to be operated upon and the 
amount that each dimension Is to be moved. As an example, given 
the definition 

var 

a,b; array [0..127, 0..127] of integer; 
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a ehift(b, 0, 3); 


is functionally equivalent to (but much faster chan): 


for 1 0 £0 127 do 

begin 

For j :■ 0 t_o 124 ^ 

a[l, Jl b[l, j+3] ; 
for J 125 £o 127 ^ 
a [ 1 , J 1 : - 0 ; 


end ; 


The ''transpose'' function Is used to transpose an array 
about two specified dimensions. If only one dimension Is 
specified* the array Is ''flipped'' about that dimension. In 
order to determine the shape of Che result at complle-tlme * the 
dimensions about which the transposition are to cake place must 
be specified by complleoClme constants. 

The shape of an array may be altered by the ''expand'' 
function. The arguments to this function are the array to be 
operated upon* the new dimension along which the expansion Is to 
take place* and a type specification. The array Is expanded 
along Che Indicated dimension. If Che rank of the array Is n* 
then Che second argument to ''expand'' can be at most n-f 1 . The 
dimension along which Che expansion Is to cake place must be a 
complle-tlme constant, in order to ensure that the shape of the 
result can be determined at complle-tlme. 

The functions "sum", "prod", "all", "any", "max", 
and ''min'' are used to reduce an array along a specified set of 
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dlaenalona* Thesa functlona dlffar only in cha raductlon 
oparation that chay parform* Tha argumanca Co a raductlon 
function ara Cha array to ba oparated upon an a Hat of 
dlmenalona over which Che reduction la to be performed* In order 
to anaure that cha compllar can datarmlne cha ahapa of tha 
reaulc, tha dimanalona muat be compl le-t Ima conacanta* Tha flrat 
dimension of an array is numbered 1 (not 0). Aa an example, 
given a cwo-dlmens inal array ''a'', 

8um(a ,1,2) 

computes the arithmetic sum of all of Che elements In ''a'', 
while 

max( a , 2 ) 

produces a vector consisting of the maximum element In each row 
of "a". 


A. 4. 3 Standard Procedures 

Like standard Pascal, Parallel Pascal also provides a set of 
standard procedures for file handling, dynamic memory allocation, 
and data transfer. Table 4 summarizes Che available standard 


p rocedures 
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Table 4: Standard Procedurea 


syntax 

meaning 

put( f ) 
get(f ) 
rese t ( f ) 
rawr i ta ( f ) 

tile handling procedurea 

append the buffer variable to file f 
get a new buffer variable from file f 
reset file f for reading (rewind) 
prepare file f for writing 

new( p ) 

new( p , 1 1 , . . . tn ) 
dispose ( p) 

dynamic memory allocation 

allocate storage, place address in p 
as above, but fix record variants 
release storage described by p 

pack( a , i , z) 
unpack( z , a , 1 ) 

data transfer procedures 

pack i elements of a into z 
unpack i elements of z into a 


A. 5 Control Flow 

In addition to the standard Pascal control structures ( if . 
case . while , repeat - unt il . goto ) . Parallel Pascal provides the 
whe re statement for conditional assignment to arrays according to 
a controlling expression. The syntax Is 

where array-expresslon do 
statement 
otherwise 

statement 

where the otherwise and the second controlled statement may be 
omi 1 1 ed . 

The execution of a where is defined as follows. First, the 
controlling expression Is evaluated to obtain a Boolean array 
(mask array). Next, the first controlled statement is evaluated. 
Array assignments are masked according to the mask array computed 
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abov*. Finally, If tharala a aacond controllad ■catamant, It la 
evaluatad. Array aaal gnmanta within tha aacond controllad 
atatamant ara maaked by tha Inveraa of tha maak array. 

whara atatamanta may ba naatad, provldad that all of tha 
controlling array axpraaalona ara conformabla and typa 
compatible. Tha effect of a whara atatamant la local to tha 
procedure or function In which It appaara - It doaa not affect 
the execution of any procadurea of functlona called from one of 
the controlled atatamanta. 
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A. 6 Patcal Grammar 

The metalanguage for thie grammar ie aa followa: 

1. The left-hand-aide of each production la aaparated from Its 

rlght-hand-alde by the symbol . 

2. Nonterminal names are represented directly* 

3. Literal symbols are underlined. In cases where confusion 
with metaeymbols Is possible, literals are enclosed In 
double-quote marks "• 

4. The vertical bar | represents a choice between alternatives. 

5. Parentheses () enclose a selection of constructions which 
are separated by vertical lines. 

6* Square brackets [] enclose a construction or choice of 
constructions which may occur zero or one times In the 
production. 

7. Curved brackets {) enclose a construction or choice of 
constructions which may occur any number of times. 

letter >111 £ 

digit 0 I 1115.1 6 I 7^18 I 9 

special-symbol and l array | begln | case | const | dlv i do | downto | else l end | 

file I for I function ! goto ) i f j In I label I mod | n 11 1 no t | 
of | packed | parallel | procedure | program l record l repeat l 
set I then l to | type l untll | var [ while | wlth 

Identifier letter { letter I digit } 
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dlraeclvt :t* I«tt«r ( | digit } 

dlglt-t«quttnc« i’.« digit { digit } 
uns lgn«d-int«g« r digit-«cqucnc« 
uns ign«d-re«l <:• 

unslgnad-inttgcr • digit»t«qu«nc« [ Actor ] | 

uns IgnAd-intAgAt a^ ACAlA-factor 
unsignAd-numbAr t:« unFignAd-int AgA r | una ignAd-rAal 
acala-factor aignad-int agar 
Aign : : • | • 

Aignad-intagar :i> [ aign ] una ignad-intaga r 
aignad-raal [ aign ] una Ignad-raal 

a Ignad-numba r aignad-intagar | aignad-raal 

labal dlgi t-aequance 

charactar-atring ' acring-element { at rlng-alamanC ) ' 

str ing-alaoent apoatropha-image I at ring-charactar 

apoa crop>a-imoge 

string-character : :■ one-of-an-ioplamentatlon-da£lnad-set-o£-characters 

block labal-declaration-part 

conatant-de£init ion- part 
typa-de£lnitlon-part 
var labia -declaration-part 
procadura-and-£unction-daclff rat ion-part 
statement-part 

label-declaration-part I label lab<»l { , label } ; 1 

constant-definition-part [ const constant-definition ; 

{ constant-definition ; } | 

type-definition-part [ type type-definition ; 

{ type-definition ; } 

variable-declaration-part [ var variable-declaration ; 

{ variable-declaration ; } T 

procedure-and -function- declaration-part : :■ 

{ ( procedure-declaration | function-declaration ) ; } 

statement-part compound-statement 

constant-definition identifier * constant 

constant subrange-constant | scalar-constant 

subrange-constant ; ; ■ scalar-constant #. scalar-constant ! 
subrange-constant-identifier 

scalar-constant [ sign ] ( unsigned-number I scalar-constant-identif i 

character-string 

scalar-constant-identifier constant-identifier 

subrange-constant-identifier constant-identifier 

constant-identifier identifier 
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typii-dafinltlon Identifier • type-denoter 

type-denoter type-identic ^er | new-type 

new-type aimple-type 1 at / icturad-typa | pointer-type 

simpla-type-identil: ier type-identifier 

atructured-type-ldentif ier type-identifier 

pointer-type-identifier type-identifier 

type-identifier : : identifier 

simple-type ordinal-type I real-type 

ordinal-type ;i- enumerated-type 1 subrange-type 1 integer- type I 
Boolean-type I char-type I ordinal-type-identifier 

enumerated-type "(" identifier-list ")" 

identifier-list : : <■ identifier { , identifier } 

subrange- type : : ■* scalar-constant scalar-constant 

8 t ructured-type [ packed ] unpacked-structured-type 1 

structured-type-identliier 

unpacked-structured-type array-type | record-type 1 

set-type | file-type 

array-type [ parallel ) array ”1” index-type ( , index-type } *' 1 " 

of component-type 
index-type ordinal-type 

component- type type-denoter 

record-type s :• record [ field-list [ ; ] ] e nd 

field-list fixed-part [ ; variant-part ] ’[ variant-part 

fixed-part record-section { ; record-section } 

variant-part case variant-selector o_^ variant [ ; variant ] 

variant-selector [ tag-field ; ] tag-type 

tag-field identifier 

variant case-cons tant- 11 s t : "(*' [ field-list [ ; 1 ] *')" 

tag-type ordinal-type-identifier 

case-cons tant-lis t case-constant { , case-constant } 

case-constant : : >■ constant 

set-type set of base- type 

base-type ordinal-type 

file-type file of component- type 

pointer-type ^ domain-type t pointer-type-identif ier 

domain-type type-identifier 

variable-declaration identifier-list : type-denoter 

variable-access entire-variable 1 component- variable 1 

ref erenced-variabia 1 buffer-variable 
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entlre^varlablo variablft-ld«nti£ler 

variable-identifier identifier 

component-variable 5i" Indexed-variable I field-designator 

Indexad-variable array-variable "I" 

[ index-expression { , index-expression } 1 [ s bit-specifier ] "1 

index-expression : : [ expression ] [ @ subrange-constant 1 

array-var iable :t* variable-access 
bit-specifier simple-expression 

field-designator record-variable • field-designator 

record-variable variable-access 

field-identifier identifier 

buf f e r-variable ss" f ile-variable 
f lie-variable variable-access 

r e f e re need- va r iable sj™ pointer-variable ^ 
pointer-variable variable-access 

procedure-declaration 

procedure-heading ; directive I 
procedure-identification ; procedure-block | 
procedure-heading ; procedure-block 
procedure-heading procedure identifier [ formal-parameter-list ] 

procedure-identification procedure procedure-identifier 

procedure-identifier id^iatifier 

procedure-block block 

function-declaration 

function-heading ; directive I 
function-identification ; function-block | 
function-heading ; function-block 
function-heading 

function identifier [ [ formal-parameter-list ) : result-type ] 

function-identification t:“ f unc t ion function-identifier 
function-identifier identifier 

result-typa type-identifier 

function-block ::«■ block 

formal-parameter-list 

'*(" formal-parameter-section { ; formal-parameter-section } ")'* 

formal-parameter-section 

value-parameter-specification | 
variable-parameter-specification 1 
procedural-parameter-specification | 
functional-parameter-specification 
va lue-pa r ame t e r- s p eci f i ca t ion identifier-list : type-identifier 

va r la b le-par ame t e r— se c t ion s;* var identif ier— List 5 type — identif ler 
bound-identifier : ; =• identifier 

procedural-parameter-specification procedure-heading 

f unc t i onal -pa r ame t e r-8 p e c i f lea t I o n !S* function-heading 
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unsigned-constant unsigned-number | character-string | 

constant-identifier | nil 

factor variable-access | unsigned-constant | bound-identifier | 

function-designator I set-constructor | 

"(" expression 1 not factor 

set-constructor "[" [ member-designator { • member-designator )1 " 

member-designator expression [ •• expression 1 

term factor { multlplying-operator factor } 

simple-expression [ sign ] term { adding-operator term ) 

expression : : • 

simple-expression [ relational-operator simple-expression ] 

multlplying-operator * 1 / 1 div 1 mod | and 

adding-operator : : • -f 1 - | o£ 

relational-operator ■ 1 <> I < 1 > I <■ I >• 1 ^ 

function-designator function-identifier [ actual-parameter-list 1 

actual-parameter-list "(" actual-parameter { » actual-parameter ) 

actual-parameter : : expression | variable-access I 

procedure-identifier | function-identifier 

statement [ label : ] ( simple-statement | structured-statement ) 

simple-statement empty-statement | assignment-statement | 

procedure-statement | go to-s tatement 
empty-statement 

assignment-statement 

( variable-access | function-identifier ) expression 
procedure-statement procedure-identifier [ actual-parameter-list 1 

goto-statement goto label 

structured-statement compound-statement | conditional-statement I 

repetitive-statement | with-statement 

compound-statement begin statement-sequence end 

statement-sequence statement { ; statement } 

conditional-statement if-statement | case-statement I 

whe re-statement 

tf-statement Booluan-expresslon then statement [ elso-patt 1 

else-part else statement 

case-statement case case-index o_£^ case-list-element 

{ ; case-lis t- element } [ ; 1 end 

case-list-element : : <■ case-constant-list : statement 
case-index : : > expression 
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where^e tatement where parallel-Booleaifexpresslon ^ 

statement [ otherwise~pact ] 
otherwlse-part otherwise statement 

parallel~Boolean''expres6lon expression 

repetitive-statement repeat-statement | while-statement I 

for-statement 

repeat-statement repeat statement-sequence until Boolean-expression 

while-statement while Boolean^expresslc i ^ statement 

for-statement for control-variable :■ initial-value 

( £0 I downto~ l final-value ^ statement 
control-var Ible entire-variable 

1 nl t la 1-value : : expression 

final-value : : expression 

wlth-8 tatement with record-var lable-li s t ^ statement 

record-variable-list record-variable { , record-variable } 

raad-parameter-lls t "(” [ file-variable , 1 variable-access 

{ , variable-access } ")" 1 

readln-parameter-llst [ '*(" ( file-variable | variable-access ) 

{ , variable-access } ”)*' 1 

wrlte-parameter-list "(" I f lle-var lable , ] variable-parameter 

{ , write-parameter } ”)" 

write-parameter expression [ : expression [ : expression 1 ] 

writeln-parameter-list [ ”(*' ( file-variable 1 write-parameter ) 

{ > write-parameter } ")" 1 

program : : ■> program-heading ; program-block . 

program-heading program Identifier [ program-parameters ")" 1 

program-parameters Identifier-list 

program-block block 
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A. 7 Parallel Pascal Error Codes 


In addition to the standard Pascal error codes (defined in 
reference 1), the following error codes are defined for Parallel 
Pascal : 


330: must be parallel array type 
331: illegal type for parallel array 
332: boolean type required 
360: arrays not compatible 

361: array not compatible with controlling array 

362: result must be array type 

363: parallel array not allowed 

364: function result type must be array 

363: dimension not compatible with array 

366: integer constant expected 

367: at least one dimension expected 

368: bit index type must be integer 

369: error in number of standard function arguments 

370: subrange exceeds array index limits 

371: set type not compatible with array index type 

373: bit indexing not allowed 

374: illegal array type for bit indexing 

373: subrange constant expected 

397: unimplemented feature 

398: implementation restriction 

399: Implementation restriction 

400: internal inconsistency 
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APPENDIX B; PARALLEL P«CODE SPECIFICATION 

i 

i 

I 

I 

I B . 1 Data PeclaratlonB 

I 

To permit efficient handling of arrays, both parallel and 

I 

^ ordinary, it is necessary that the code generator be supplied 

j with information about the size and shape of data items. Thus, 

the intermediate language must include specifications for the 
i fundamental data types (arrays and records). 

The code generator's view of the world is based upon the 
following assumptions: 

1. The code generator ''knows'' whether the code it is 
generating is to reside on the host (for the HPP this would 
be the VAX) or the sequential control unit. 

2. A few standard types are predefined: 

integer 

real 

Br*olean 

char 

scalar pointer 

array (l.e. parallel) pointer 
file 


All other types that the code generator must deal with are 
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defined in various ' ' pseudo-op ' ' statements in the 
intermediate language* 

3* The size and shape of arrays and the layout of records are 
known to the code generator* 

The compiler specifies the target machine on which code is 
to be generated via the * S ITE pseudo**operator • The syntax is: 

*SITE sitename 

where sitename may be either ''HOST'' to specify the host 
processor (for the MPP this is the VAX-11/780) or ''MCU'' to 
specify the (main) control unit of the parallel processor* 

The following pseudo-operators are used to define derived 
types : 

•ARRAY This pseudo-operator is used to specify the size and 
shape of an array type* The syntax is: 

* ARRAY newname,basetype,rank, dimOlow, dlmOhigh • * * * 

where newname is the name of the type which is being 
defined, basetype is the name of a previously defined 
type, rank is the total number of dimensions of the 
array, and d imi low dimihigh are the low and high bounds 
of each index range* rank is negative if the array is 
a parallel array* For instance, the type definitions: 

type 

parr ■ parallel array [1**128,1**1281 o_^ integer; 
arr ■ array [5**10] oj^ parr; 
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would be translated to 

t ARRAY parr, Integer, -2, I, 128, 1, 128 
•ARRAY ar r , par r , 1 , 3 , 10 

•RANGE This pseudo-operator Is used to define a subrange^ The 
syntax is: 

•RANGE newname , low, high 
For example, the definition 
type 

XXX ■ 10 . • 32 ; 
would be translated to 

• RANGE XXX, 10,32 

•RECORD This pseudo-operator is used to define a records The 

code generator must know the configuration of a record, 
hence it is necessary to provide the type of each 
components The code generator is responsible for 
computing the appropriate offsets The syntax is: 

• RECORD recname , cmpname , of f set , type 

where recname is the name of the record type, cmpname 
is the name of the current component, offset is the 
offset (see below) and type is the type of component. 
There is one recname field per record and one cmpname 
for each component. offset is normally ''nil'', 


indicating that the code generator should choose the 
offset (normally as the next component in the record 
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being defined). If of f let is not ''nil'' it specifies 
a record component; in thie case, the current record 
component is to have the same offset as the named 
component. Thia is used to align variant records. A 
. RECORD statement implicitly defines two types: the 
record itself and the record component. As an example, 
the type definition: 

tec • record 

x: integer; 

y: real; 

case Boolean o_f^ 

false:zf: integer; 
true: zt : real; 
end ; 

could be translated as follows 


.RECORD rec ,x ,nil , integer 
.RECORD rec , y , nil , real 
.RECORD rec , zf , nil , Integer 
.RECORD rec , zt , zf , real 

(The last definition specifies that the component 
''zt'' is to aligned with the component ''zf''.) 


.SET This pseudo-operator is used to specify the size of a 

power set. The syntax is: 

.SET newname , low, high 

where newname is the name of the type which is being 
defined, low is the lowest element (integer) and high 
is the highest element. Sets of type char are 
converted by the compiler to the appropriate Integer 


type 
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.FILE This ptaudo**op«racor la uaad Co apacify a flla* Tha 
ayntax la: 

•FILE newnaoa I ftypa 

aawnama la daflnad Co be a flla of f c ypa » 

.POINT Thla paeudo-oparacor la uaad Co defined polncer Cypes. 

For Che moaC parc» polncers are conalderad Co be Che 
same chlng aa InCegera; however^ occaelonally ic la 
neceasary Co dlaclngulah chetn. The aynCax la: 

.POINT newname » pCype 

whlv^h cauaea newname Co represenc a polncer Co cype 

ptype * 

.TYPE Thla paeudo**operacor equlvalencea an exlaCA.ng cype-name 

wlch a new name. The syntax Is: 

.TYPE newname , oldname 

newname la defined Co be Che same Cype as oldname . 

This redundant scacemenc allows some simplification In 
Che front-end of the Parallel Pascal compiler. 

The Intermediate language representation of arrays consists 
of two portions. The first Is the (static) logical type 
information specified by the . ARRAY pseudo-operator. The second 
is the (dynamic) Information about the physical storage 
allocation which is required at runtime. If set and vector 
indexing are excluded, then Parallel Pascal permits any 
contiguous subset of array elements to be operated upon at once. 
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Th« lnL«rm«dlat« languaga uanlpulatas arrays through a concaptual 
antlcy callad an array daacr Iptor » Inacaad of puahlng cha actual 
alamante of an array onto tha stack, ths dsscrlptor for that 
array is pushed Instsad* Ths array descriptor specifies the base 
address and the storage mapping defined by the index ranges of 
each array dimension* The compiler does not know or care what 
format the array descriptor has. The LLA instruction is used to 
load a ''blank'' descriptor onto the stack (a descriptor 
specifying the address but no index ranges); a sequence of 
indexing instructions ( 1X0 . 1X1 . 1X2 ) is performed to ''fill in'' 
this information. 

Records are similarly defined by a record descriptor . Like 
array descriptors, record descriptors consist of the (static) 
information provided by the . RECORD pseudo-operator and the 
(dynamic) information contained on the runtime stack. The 
dynamic information specifies the address of rhe record and the 
f i ds of the record which have been selected to participate in a 
future operation (e.g. load, add, store). The compiler does not 
know or care about the format of this information. A record 
descriptor is constructed by performing an LLA (which loads a 
descriptor for the entire record) followed by one or more SEL 
instructions to select successively-nested fields. 

Records may contain arrays and array elements may be 
records. The appropriate combination of and SEL 

instructions is used (recursively) to select a set of array 
elements within a record and a field within a set of nested 
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Moot of th« Incarmadiaca languaga oparacora raquira cha 
spac 1 £ icac Ion of a type* The following coda aegoent llluatratea 
how tha (atatic) typa and (dynamic) array daacrlpcor ara uaedi 

typa 

arrl ■ array [I. .5) o£_ Intagar; 
arr2 ■ array [2. .6) o_f integer; 

var 


a : 

arrl; 

b : 

ar r 2 ; 

begin 

b[ 1(91 . . 5] ; 

a 

end . 



.ARRAY arr I , integer • 1 , 1 , S 
.ARRAY ar r 2 , Intege r , 1 , 2 , 6 


IXO 

L1.A 

<addrase 

of 

1X0 

arrl 


LLA 

<addres8 

of 

LDC 

integer 

2 

LDC 

Integer , 

6 

1X2 

arr 2 


LOl 

ar r 2 


STO 

arrl 



An example with records and record descriptors (the '^xxx'' 
and ''yyy'' are arbitrarily-chosen names): 

type 

[1..101 o_£ real; 

recrd ■ record 

X, y: arrlO; 
end ; 

arrrec ■ array (1..3) of^ recrd; 


var 


V : arrrec; 

V . X I 5 ] : ■ 0 ; 
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.ARRAY 

.RECORD 

.RECORD 

.ARRAY 


Ar r 10 , r«Al , 1 , 1 , 10 
r«crd |X,nii t^rr 10 
rtcrd »y,nll ,arr 10 
*rrr«c I rtcrd ,1,1,5 


LLA <addrttt of 
IXO arrrte 

.ARRAY XXX , rtal , 1 , 1 , 5 

SEL arrrte, X, XXX 

LDC inctgtr,5 

.POINT yyy,rtal 

1X1 XXX, yyy 

LDC lnttgtr,0 

eVT lnttgtr,rtal 

STO real 
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B. 2 Procedure/Function Argumentt and Local Varlablee 


The following pteudo^optracor s are used Co define subrouclne 
argumencs and local variables: 


.ENTRY This pseudo-operacor Indicates that a new block Is 
being entered. 

.EXIT This pseudo-operator complements . ENTRY by Indicating 

the end of a block. Definitions In the current block 
are to be ''forgotten'' by the code generator at this 
point. 

.ARC This pseudo-operator defines an argument to the current 

subrouclne. The syntax Is: 

.ARC num,type,rv 

where num is an Integer which starts at zero (see 
comment below) and is incremented by one for each 
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argum«ne, typa !• Che type at Che ergunenc» end rv it 
etcher zero or non-zero co IndlceCe chec che dece it 
being peeted by value or referencei retpecClvely* The 
nuo field normally it e poaicive Inceger. If che 
•ubroucine recurnt e value (i.e. if ic le e funcCion 
reCher Chen a procedure) che space for che resulc is 
reserved by an .ARC pseudo-operacor wich zero in che 
num field. 

This pseudo-operacor defines local variables. The 
syncax is: 

.LOCAL num,Cype,equ 

where num and cype are defined as for . ARC . agu is 
used Co indicaCe sCorage sharing Co Che code generecor. 
If egu is zeroi che nexc available memory loc^cion 
should be allocaced. If egu is non-zero Ic specifies a 
previously-defined local variable; in Chis case Che 
SCorage for che new local variable la Co be allocaCed 
on cop of che previously-defined variable. The numbers 
assigned Co local variables belong co che same space as 
chose assigned Co subroudne arguments. Thus* if chere 
are ^ argumencs Co a subroutine (and hence n^ . ARC 
sCaCemenCs), Che num field for che first . LOCAL 
scacemenc will contain n-fl . 
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U . 3 Pttall«l Pcod« Mnemonlct 

The nmemonlce ar* their functlona for the opcodes defined in 
parallel p-code are as follows: 

ABS Produce absolute value. The syntax is: 

ABS type 

ADD Add two operands. The syntax is: 

ADD type 

AND Perform Boolean ''and''. This is only defined for Boolean 
variables; however, an array may be specified so a type is 
required. The syntax is: 

AND type 

CHK Check that top of stack is between two specified values. 

CSP Call standard procedure. The syntax is 
CSP procedurename ,argtype , re suit type 

where "argtype'' is the type of the primary argument and 
' ' resulttype ' ' is the type of the functlf'n result. (If the 
called routine is a standard procedure the literal string 
''nil'' is used.) Calls of standard procedures and 
functions are discussed in more detail below. 

CUP Call user procedure. The syntax is 


c- - 
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CUP l«v«l »proc«dur«nao« ,r«suXctyp« 

Calls of user procedures end functions are discussed in 
mors detail below. 

CVT Convert the top of stack from one type to another. 

Conversions performed by this operator may alter both the 
shape (array dimensions) and the underlying type of the 
object. The syntax Is: 

CVT oldtype, newtype 

CVN Convert the next-to-top of stack from one type to another. 
This Is similar to CVT, defined above. The syntax la: 

CVN oldtype, newtype, tstype 

where tstype Is the type of Item on top of the stack (this 
Information Is required In order to locate the next-to-top 
element, since descriptors vary In size). 

DEC Decrement top of stack by a specified amount. This may 

only be applied to integers or subranges or arrays of type 
Integer or subrange. The syntax Is 

DEC type, amount 

DIF Evaluate set difference. The syntax la: 

DIF type 

DIV Perform real division. The syntax Is: 


DIV type 
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Notice that, unlike Pascal P4 , Integer operands must be 
explicitly converted to real format before the division* 

DUP Duplicate top of expression stack* The syntax is 
DUP type 

DVl Perforin integer division. This may only be applied to 
Integers or subranges or arrays of type integer or 
subrange* The syntax is: 

DVI type 

ENW This operator is used to remove the effect of a mask (''end 
where'')* The syntax is: 

ENW type 

where type Is the type of the mask (located on top of the 
expression stack)* The mask stack is ''popped''; the 
previous mask (that is, the mask in effect before the most 
recent WHR ) is restored* This operation is illegal if 
there is no current mask* 

EOF Test for end**of-file condition. There are no arguments, 
the filename Is assumed to be the top item on the stack* 

EQU Test for equality. The syntax is: 

EQU type 

FJP Jump if item on top of stack is false. The item mus t be a 
scalar Boolean quantity. The syntax is 
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CEQ Teat for greater-chan or equal**to« The syntax is: 

GEQ type 

CRT Test for greater-than . The syntax is: 

CRT type 

INC Increment top of stack by specified amount* This may only 
be applied to integers or subranges or arrays of type 
integer or subrange. The syntax is 

INC type, amount 

INN Test for set membership* The syntax is 
INN settype 

INT Perform set Intersection* The syntax is 
INT settype 

lOR Perform Boolean inclusive or* This is only defined for 

Boolean variables; however, an array may be specified so a 
type is required* The syntax is 

lOR type 

IXO Index with zero values. This operator is used in the 

construction of array descriptors. It ''fills in'' the 
index specification for the first unspecified index range 


/ 
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In the array deacrlptor on Che top of the runtime tcacbt 
The syntax is 


1X0 type 


IXl Index with one value* This operator is used in the 

construction of array descriptors. The top of stack is an 
array descriptor for which at least one dimension is 
unspecified. This instruction selects one value for the 
first unspecified dimension* This reduces the dimension of 
the array by one The descriptor on top of the stack will be 
modified CO reflect this new type; if the logical type is 
now a scalar this descriptor will (conceptually) be a 
pointer to a scalar. The syntax is 

1X1 oldcype , newcype 

where ''oldtype'' is the type of the descriptor on top of 
the stack before indexing and ''newtype'' is the type after 
indexing * 

1X2 Index with two values* This operator is used in the 

construction of array descriptors. It ''fills in'' the 
index specification for the first unspecified index range. 
The second element on the stack and the Cop of stack are 
integers specifying the low and high bounds, respectively. 
The third element on the stack is the array descriptor. 

The syntax is: 


1X2 type 
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LCA Load addraaa of conatanC. Tha ayntax la: 

LCA type , cona tanc 

The conatant Itself la specified, the code generator la 
raaponaible for aetting up a atatic constant somewhere* 

The conatant must be a scalar. 

LDC Load conatant. The syntax is 
LDC type, constant 

LDI Load indirect (load value pointed to by top of stack). The 
ayntax is 

LDI type 

When an LDI la applied to a file, the file buffer is loaded 
onto the stack* 

LEQ Test for less-than or equal-to* The syntax is 
LEQ type 

LES Test for less-than* The syntax is 
LES type 

LLA Load address. The syntax is: 

LLA lexlevel, localid 

where lexleve 1 is the lexical level (the level of nesting) 
and localid Is the local variable Index number as specified 


by a . LOCAL or . ARC definition (see above) 
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LOD Load contanr.1 of address* The syntax Is 
LOD type , lexlevel , localld 

where type is the type of the varlablei lexleve '* is the 
lexical level (level of nesting), and localid is the local 
variable index as defined by a » LOCAL or » ARC definition 
(see above). 
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MOD Perform modulus (remainder) operation. This is only valid 
for items of type integer (or subrange) or arrays of type 
integer (or subrange). The syntax isi 

MOD type 


MOV Move a specified number of storage units* The syntax is 
still unknown. 

MUL Multiply* The syntax is: 

MUL type 


HST Mark stack (used for procedure calls). The syntax is 
MST level 

where level is the lexical level of the procedure or 
function which will be called* 

MEG Negate top of stack. The syntax Is 
NEC type 


NEQ 


Test for not-equal. The syntax Is 
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NEQ typ« 

i'lOT Perform Boolean not (logical complamant ) « Thla la only 
valid for Boolean acalara and arrays. The syntax is 

NOT type 

ODD Test for odd. Thla is only valid for Integer (or subrange) 
scalars and arrays. The syntax is 

ODD type 

OTW This operator loplements the "otherwise'' conditional. It 
is used to reverse the sense of a nested mask established 
by the WHR instruction. The syntax la: 

OTW type 

where t ype is the type of the currant meslc. (At this 
pointi the expression on top of the expression stack 
specifies the current mask.) OTW is illegal if no mask is 
currently in effect. If a mask is currently in effect, the 
new mask is computed by performing an exclusive-or between 
the current mask and the previous mask (that iu , the mask 
that was in effect before the most recent WHR ) . 

RET Return from block. The syntax is: 

RET type 

where t ype is either the literal string "nil'' or Is a 
type name. In the former case, the called routine is a 
procedure and no value is to be returned to the caller. In 
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eh« caa« of a function, typa la tha typo of tha data which 
Ir raturnad by tha function. (Sao balow.) 

SEL Salact a racord flald. Thla oparator cauaaa a racord 
daacrlptor to ba coaatructad from an addraaa, array 
daacrlptor, or racord daacrlptor alraady on tha atack. Tha 
ayntax la: 

SEL oldtype , cmptypa , nawtypo 

where oldtype la the type of the current top-of -atack, 
cmptypa la the type of the Item being aelected (l.e. it 
apeclflea the racord type and tha componant name), and 
nawtype la the type of the result. 

SUB Perform subtraction. The syntax is 
SUB type 

SGS Generate singleton set. The syntax Is: 

SGS settype 

The set is constructed from the element on top of the 
stack. 

SQA Square top of stack. The synta:: Is 
SQA type 

STO Store indirect (at address specified by second element on 
stack). The syntax is: 


STO type 
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Uh«n « STO !• applied Co a flia, the Cop of scack la acorad 
In cha flla buffer* 

STP Stop axacuCion. Thera are no argumanca. 

STR Score ac complla-clma known addraaa* The ayncax la 
STR Cype I laxlaval • localld 

where c ypa la ch< cypa of cha value, laxlaval la Cha 
lexical level (level of neacing), and local Id la Che local 
variable Index aa defined by an » ARC or « LOCAL deflnlClon 
(see above). 

UJC Error In caae acacemenc (abort). The aynCax Is 
UJC label 

UJP Unconditional Jump. The ayncax la 
UJP label 

UNI Perform set union. The syntax la: 

UNI type 

WHR Define a new logical mask (''where''). The syntax Is: 

WHR type 

where t ype must be an array of type Boolean. Masking Is 
performed in a nested manner. If there Is no active mask, 
the Cop of the expression stack defines the mask. If a 
mask is active, the current mask is logically ANDed with 
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th« Array on che top of tha txpraisalon jtack to form tlt« 
naw maak. (Tha pravloua mask la ' ' puahad ' ' • ) In thla caaai 
tha two axpraaalona auat hava Idantleal typaa* Maaka 
aatabllahad in thla faahion can ba furthar manipulatad with 
tha OTW and ENW oparatora. 

XJP Indaxad jump (Jump to apaclflad valua top of atack)* Tha 

item on tha top of tha atack ouat ba an Intagar or 
aubranga. Tha ayntax la 

XJP value 


B . 4 Daacrlptora and tha Stack 

Aa In P4 , all oparatlona are performed on a conceptual 
atack. However, the atack may contain aeveral different typaa of 
Items beyond thoae allowed by P4 . The repreaentat Ion of the 
various data types la described below: 

scalars Scalars are manipulated directly on che stack, l.e. 

when a scalar la loaded the value of the scalar la 
placed on the stack. Thus, scalars can be directly 
manipulated • 

arrays Unlike scalars, arrays are not pushed on Che stack. 

Instead, the array is described by an array descriptor . 
The array descriptor consists of the type of che array 
(as defined by a . ARRAY statement) which is static, and 
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• dyaamic portion which apot.ifiot tho array addraaa and 
tha indexing information* Tha descriptor is 
constructed by performing an LLA (loading tha "lexical 
address'' of a variable) or CVT (whan a scalar ia 
converted to an array) followed by a seriaa of indexing 
instructions ( 1X0 , IXl . 1X2 ) to specify the index 
ranges • 

records Records are represented by a record descriptor whose 

exact format is the domain of the code generator* The 
layout of the various records is specified by * RECORD 
pseudo'ops and ths descriptors themsulves are 
constructed by use of the SEL instruction* 

sets Sets are represented by a set descriptor which merely 

gives the address of the set. The permissible values 
in the set are known statically and are given by * SET 
s tatements . 

Unlike the use of scalars, performing a "load'' of an 
array, record, or set does not place the data on the stack* 
Instead, the hypothetical stack machine (machine code generator) 
replaces the descriptor on top of the stack with another which is 
identical except for the address (the address of a temporary 
i array location is used). Similarly, when the conversion operator 

I ( CVT or CVN ) is applied to convert one item to another, the 

conversion is performed into a temporary area and the appropriate 

1 

I, 
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d«scrlptor !• plac«d on th« stack* 

It Is lllsgsl to perform any oparation in which ths ttlxA and 
shape of ths arguments do Pdt match* It is possible, however, 
chat two arrays with different Index ranges (but Identical 
shapes) will be combined by an opecacion* In this case, the 
result will be placed Into a temporary array according to the 
Indices specified by the second descriptor on ths stack* Also, 
an array consisting of a record component may be combined with 
another array, provided that the base types and array shapes 
match • 


The function of Che CVT and CV N operators whsn one of the 
k.ypes Is a scalar deserves some comment* First, either may be 
used Co convert a scalar to an array with the same type as the 
scalar. For Instance, 

.ARRAY arr , real , 1 , 1 , 5 (array [1**5] of real) 

CVT real, arr 

In this case, the scalar on top of the stack Is replaced by an 
array •lescrlptor (which references a temporary area). This 
descriptor Is "blank'' - no Indexing Information Is specified. 

A sequence of IX ? Instructions Is then performed to "fill In'' 
the Indexing information. Second, either may be used to convert 
a one**element subset of an array Into a scalar. In this case, 
the array descriptor on the stack Is replaced by the appropriate 


scalar value 
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B . 5 Functloo and Proctdurt Callt 

Standard procaduraa and functlona (hareaftar called 
■ubroutlnea) art all called in the same fashion. First the stack 
is marked with MST , specifying that the called routine is at 
lexical level zero. (The lowest lexical level available to user 
routines is 1, and that is used for the main program.) Next, the 
arguments are evaluated lef t- to**rlght and placed on the stack. 
Finally, the routine is called with a CSP instruction. In all 
standard function and procedure calls at most one argument has a 
type which varies from call to call; this Is referred to as the 
''primary'' argument. In addition to specifying the called 
routine, the CSP instruction specifies the type of the primary 
argument and the result type of the function (''nil'' if the 
called routine is a standard procedure). 

Scalars are passed to subroutines on the stack; structured 
types are passed via descriptors. For call by reference, the 
array, set, or record descriptor is pushed on the stack. For 
call by value a LDl is performed - this causes the array to be 
copied into a temporary area and a descriptor for this temporary 
array to be placed on the stack. 

Control is returned to the caller when the RF.T instruction 
is executed. If the called routine is a procedure, the runtime 
stack is reset to the last marked location. If the called 
routine is a function, the returned value is placed on the 
runtime stack in the locations reserved for it (see above) and 
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the stack la reast to the end of this area* 

User aubroutinea are called in the same fashion as the 
standard ones except that the CUP Instruction Is used* This 
Instruction specifies the lexical level at which the user 
subroutine will run and (for functions) the data type of the 
function result* (This Is ''nil'' for procedures*) 

B*5*l Elemental Functions 

The Pascal standard functions all may operate on either 
scalars (as In standard Pascal) or arrays* The result has the 
same shape as the function argument (although sometimes the base 
type Is different)* Because they operate Independently upon the 
elements of arrays these functions are referred to as ''elemental 
functions''* The following functions form this set: abs, arctan» 
chr, cos» eof» aoln, exp» In, odd, ord, pred, round, sin, sqr, 
sqrt, and trunc* For all nf these functions, the stack Is 
marked, the argument Is loaded, and the function Is called: 

MST 0 

<load argument> 

CSP func ,argtype , resulttype 

An argument is always specified; If no argument was specified in 
the Parallel Pascal program (£* 5 ,* using ''eof'' with no argument) 
the default (''Input'' In this case) is explicitly specified by 
the compiler ''front-end''* 
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B*5*2 Traniformacional Functloni 


In addition to the standard functions provided by Pascal,, 
Parallel Pascal contains some standard functions which perform 
transformations on entire arrays. This set of functions includes 


shift 


0 * « % 


rotate 


t t t . 


trans'', ''expand'', and the reduction 

min'', ' ' prod ' ' , and 
''sum''). In all of these cases the only argument to the 
function whose type varies is the array to be transformed. The 
stack is marked, the arguments are pushed, and the function is 
called. The old type of the array and the type of the function 
result are specified in the CSP instruction. 


functions (''any'', ''all 


' ' , ' ' max ' ' , ' ' 


The ''shift'' and ''rotate'' functions have the following 
calling sequence: 

MST 0 

LLA <array address> 

IX? ... ;specify index information 

LDC integer,<#> ;one for each array dimension 

CSP f unc , arrtype , resulttype 


The ''expand'' function has the following calling sequence: 
MST 0 

LLA <array address> 

IX? ... ;speci£y index information 
LDC Integer, <#> ;new dimension 
LDC integer, <#> ;low bound of new dimension 
LDC integer, <<l> ;high bound of new dimension 
CSP expand , arrtype , result type 


The redunction functions have the following calling sequence: 
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MST 0 

LLA <array addras8> 

IX? ••• (Specify index information 
LDC integer ,( <sec of dimension8>) 

CSP func • err type , result type 

Mote that the dimensions along which the reduction is to take 
place are specified in one powerset constant* 

B.5.3 Input and Output Procedures 

The standard procedures "get'', ''put'', ''reset'', and 
''rewrite'' each operate upon one argument* The calling sequence 
is 

LLA <file> 

CSP f unc , f iletype , ni 1 

where ''filetype'' la the logical type of the argument* 

The standard procedures "read'' and ''write'' are actually 
implemented by several specialized procedures* When the file 
type la not ''text'' (that is, not file of char) the following 
equivalent sequences are used: 

read(f,x) = x :■ ft; get(f) 
write(f,x) s ft :■ x; put(f) 

When the file is of type ''text'' the standard procedures 
''rdi'', ''rdr'', and ''rdc'' are used for reading integers, real 
numbers, and characters (respectively); the standard procedures 
''wri'', ''wrr'', ''wrc'', and ''wrs'' are used for writing 

integers, real numbers, characters, and strings (respectively)* 


The read functions have the calling sequence: 
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LLA <£lla> 

GSP £unc I text t resul ictype 


where ''text'' Is the type name for a file of thar. 


The write procedures require additional arguments* These 
specify the field width and the scale factor (this Is meaningful 
only for floating-point numbers)* The calling sequence for 
"wrl" and "wrc" Is: 


<compute expression to be output> 

LOG Integer , <wldth> 

LLA <flle> 

GSP func ,exprtype » nil 

The ''wrr'' function requires the scale factor: 

<compute expression to be output> 

LDG Integer , <wldth> 

LDG Integer , <scale£actor> 

LLA <flle> 

GSP wrr ,exprtype»nll 

The ''wrs'' function requires one additional parameter - the 
string length: 

LLA <strlng> 

LDG Integer , <wldth> 

LDG Integer , <length> 

LLA <flle> 

GSP wrs , strlngtype , nil 


B.5.4 Miscellaneous Standard Procedures 

The procedures ''new'' and ''dispose'' are used for dynamic 
memory allocation. The calling sequence is: 

LLA <pointer> 

CSP f unc , pointertype , nil 
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There Is no concept of packed data in Parallel P-code; 
hence » the Parallel Pascal procedures ''pack'' and ''unpack' 
have no Parallel P^code counterparts* 


B. 6 Masking 

Normally, all selected elements of arrays participate in all 
operations* A subset of these elements can be selected by 
specifying a mask. When a mask is in effect, all array 
assignments must conform to the shape of the mask. Masking is 
done with the use of a mask stack * If the stack is empty, no 
masking is in effect* If the stack is non-empty, the top of the 
mask stack specifies the current mask. The mask stack can be 
implemented as a set of pointers to values on the runtime 
(expression) stack. Expressions which are used to construct 
masks remain on the expression stack until the mask is removed. 
(They are never examined by the compiled code after they are 
calculated; thus, the storage specified by the runtime stack may 
be used to hold temporaries for the mask stack.) 

A new mask is established with the WHR (''where'') 
instruction. The top of the expression stack is logically AHDed 
with the top of the mask stack, and the result is pushed on to 
the mask stack. (If there was no previous mask, the expression 
is simply pushed onto the mask stack; if there was a previous 
mask its type and the type of the new expression must be 
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The OTW otherwise ) instruction provides the means for 
reversing the sense of the Isec conditional* If there is only 
one mask on the mask stack, it is complemented* Otherwise, the 
new mask is computed as the excluaive-or of the current mask (top 
of the mask stack) and the previous mask (next-to-top of the mask 
stack) * 

The ENW (''end where'') instruction is used to ''pop'' the 
mask stack. The mask stack and the runtime stack are popped* If 
the mask stack is now empty, the effect of masking is removed* 

Masking only affects the STO and STR instructions (i*e* only 
assignments)* The effect of a mask is not transmitted to any 
called procedures or functions* 
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APPENDIX C: HIGH LEVEL LANGUAGES FOR PARALLEL MATRIX PROCESSORS 


A number of languages exist which could potentially be 
implemented on a parallel matrix processor. This appendix 
considers these languages. They can be grouped into conventional 
languages (that is, languages designed for conventional machines) 
and array languages (those designed with parallel processing in 
mind ) . 


C . 1 Conventional Languages 

There is a wide variety of languages in this class. Of 
these languages, three stand out as possible candidates: APL 
(because of its Inherent array capabilities), FORTRAN (because of 
its popularity among scientists and engineers), and PASCAL 
(because of its growing popularity in the programming community). 

C. 1 . 1 APL 

APL (''A Programming Language'') was originally developed by 
Iverson as a mathematical notation. It is characterized by a 
rich set of primitive functions, compact notation, and flexible 
data handling. The type, shape, and size of data is runtime 
dependent, and primitive functions as well as properly-written 
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u««r functions can opsrsts upon data of graaely dlvarsa alas* | 

s 

3 

I 

APL provides a direct means for specifying parallel : 

S '' 

operations ** entire arrays may be manipulated at once in a 
variety of ways* Unf ortunately» the flexibility of the size and 
shape of expressions is often obtained at a high execution cost* 

APL implementations usually involve some degree of interpreted 
code (or periodic recompilation of code to adapt to new data 
shapes ) * 

APL provides no data structures in addition to the (very 
flexible) array* The only control flow construct (aside from 
function calls* which may be recursive) is the branch statement* 

Some programmers dislike APL because it is by nature unconventional. 

These factors, in combination with the concern over its runtime 
efficiency make it unsuitable for direct implementation on a 
parallel matrix processor. Only a few restrictions need to be 
made to the APL language for it to be completely compilable. 

These restrictions and other modifications to the APL to make it 
suitable for parallel matrix processors are described in reference 1. 

C* 1*2 FORTRAN 

FORTRAN is, in a sense, the "grandfather'' of high-level 
languages* Designed in the early 1950's, it is the oldest high- 
level language still in use* It was designed for numerical 
computation (the name stands for "FORmula TRANsiat ion ' ' ) , and 
many highly-optimizing compilers for FORTRAN produce very fast 
numerical code* Because of its age and dominance in the 
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scientific processing field it is fsmilisr to most scientists end 
engineers. Its widespread use has also resulted in the 
development of a number of software libraries which assist in the 
construction of large programs. 

fortran's age is a mixed blessing, however. The language 
was developed before lexical analysis and parsing were fully 
understood, and its syntax is flawed in a number of ways. It is 
not conducive to structured programming (although the 1977 
standard does provide a number of revisions toward this end). It 
provides no data structuring facilities other than the array, and 
is very awkward when dealing with character data or complex 
program control flow. It does not provide any aggregate array 
operations (although array facilities are under consideration for 
the next FORTRAN standard )[ 2 ] . It therefore is unattractive as a 
language for a parallel matrix processor. 

C . 1 . 3 Pascal 

The programming language Pascal[3] was designed to achieve 
several goals, including[4] 

• To make available a notation in which the fundamental 
concepts and structures of programming are expressible in a 
systematic, precise, and appropriate way. 

• To make a notation available which takes into account the 
various new insights concerning systematic methods of 
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• To doaonscraco chat a Languaga with a rich aac of flaxlbla 
data and program atructuring facliltlaa can ba ImpLamantad 
by an afflclant and modcrataly alzad compilar* 

Tha raauLtlng languaga haa haan tha cantar of a graat daal 
of attantlon alncc Ita davalopaant* It la making Inroada Into 
araav pravloualy occuplad by FORTRAN (for axamplot Introductory 
progtj.mming couraaa at many uni va ra 1 1 laa ) and haa bacoma vary 
popular in tha so-callad ''parsonal computar'' markat* 

Paacal provldaa a flaxlbla data atructuring facility, 
permitting programmara to collect data into aggregate atructuraa 
(records) and to define enumerated scalar types to provide 
mnemonic accaas to flag variables, ate. To reduce tha errors 
which occur from Incorrectly specifying tha type of a data Item, 
strong type checking Is enforced. Type compatibility is checked 
at compile time whenever possible (thereby providing for fast 
execution ) . 

Pascal la not without its faults. Thera has been soma 
discuaelon concerning ambiguities in tha typing mechanism and 
inoacurities in the use of records [ 3 , 6, 7 ] . Two other problems 
are particularly distressing: the lack of a separate compilation 
facility and the lack of dynamic- length arrays. 

Some implementations of Pascal do provide for separate 
compilation (Wlrth'a PASCAL 6000, for example), but these often 
are done in a way which eliminates the advantages of Pascal's 
strong type checking. There have been a number of proposals 
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addr«salng thia laaua [ 8 , 9 , 10 ] . Parhapa tha raaaon (or chia 
omlaalon la cha philoaophy axpraaaad by Wlrch[4] chac with a 
aufflclantly faat coapllar (and no llnkaga adltor) It would ba 
accaptabla to maka all ehangaa at tha aourca lavali and aarga 
aourcaa togathar* Unf ortunataly » thla philosophy doaa not work 
wall on noat ayatama whara racompllat Ion la axpanalva, aapaclally 
whan tha change which (oread tha racompl lat Ion affacta only a 
amall portion o( tha coda* 


The (Ixad-alza array problem reaulta (rom PaKcal'a atrong 
type checking and tho (act that th<£ array Index rangaa ara 
conaldarad part of tha array type. Thla la aapaclally limiting 
whan an array la pasaad aa a parametar to a procadura or 
(unction, prohibiting the daaign of ''library functlona'' ouch aa 
a general aort routine* A number of aolutiona have bean 
auggaatad [ 1 1 , 12 , 13 ] Including a parameterization achama propoaad 
by WlrtHri 4 ]. Mora recently, the ISO Paacal atandard[15] haa 
Introduced the concept of a ''conformant array schema'', a means 
by which a parameter to a procedure or function may be an array 
whose Index range la determined when the procedure or function la 
called * 


Despite its limitations, Pascal is a powerful language which 
can be efficiently implemented* Although the standard language 
does not possess any facilities for expressing parallel 
computation. It forms an attractive base upon which such 
facilities could be built* 



213 


ORIGINAL PAaS m 
OF POOP Q 'Aury 

C . 2 Array L*ngua£«« 

Th« languagaa dltcufsad abovt wara daalgnad for afflcianC 
ImplaaantaC Ion on a convantlonal (non-parallal ) procaaaor* In 
ordar to afflclantly axacuta programa on an SIMO-claaa parallal 
procaaaor, It la nacaaaary that computationa ba parformad In 
parallal whanavar poaalbla* Thara ara baalcally thraa waya to 
achiava thla goal - ualng a vactorlzlng compllar, a languaga 
which dlractly apaclflaa tha Implamantat ion , or a languaga which 
dlracCly apacifiea the parallaliam but not the low-laval 
implementation. In the following aactions, each approach ia 
conaidared • 

G.2.1 Vectorizing Compilera 

The firat approach la to uaa a conventional language and 
write a compiler which can detect operatione that can be 
performed in parallel* (The portion af the compiler which 
performs this cask is often referred to as a ' ' vectorizer . ' ' ) Two 
examples of this technique are ILLIAC IV Fortran and the 
Paraphrase vectorizer. 

C.2.1.1 ILLIAC IV Fortran 

On the ILLIAC IV, the Fortran compiler contains a phase 
called the ' ' Paralyzer ' ' (for ''parallelism analyzer and 
synthesizer'') which performs parallelism analysis and converts 
the original Fortran code Into IVTRAN, an extended Fortran 
dialect [ 16, 17 ] . (IVTRAN Is discussed In more detail In section 
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B. 2*2.4.) Th« P*ralys«r analysts naatad DO loops and axtracta tha 
Inharent parallallsn* aubjacc to a nuabar of roatrlccions . Tha 
Paralysar output la than furthar procaaaad by tha IVTRAN eompllar 
to pcoduca tha objaet prograa* Slnca tha Paralysar accapta 
standard Fortran at Input, tha uaa of tha Paralysar with IVTRAN 
paraittad an ILLIAC IV uaar to run standard Fortran prograaa on 
tha ILLIAC IV vlth no ehangas* 

C. 2. I . 2 PARAPHRASE 

Tha PARAPHRASE vactorlsc r [ 18 ] Is not, by Itself, a eompllar. 
Rather, It was designed as a praprocatsor for SIMD machines (tha 
specific focus In this caaa was on pipelined vector machines). 

It performs a number of sourca-to-sourca optimisations on FORTRAN 
programs which restructure those programs for phrallel execution. 

PARAPHRASE produces output In standard Fortran with only two 
extensions - the specification of ''vector loops'' to mark loops 
which can be executed in parallel, and a provision for masking 
conditionals by a mode vector. As a result of this approach, the 
output from PARAPHRASE is relatively portable. This output can 
then be processed by a relatively unsophisticated compiler for 
the target machine to produce the final object code. 

C.2.1.3 Vectorizing Compilers; Conclusions 

The advantage of the vectorising compiler approach is that 
the programmer need not learn anything new. Unfortunately, 
language systems based upon vectorizing compilers suffer from 
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••v«ral problams* On* major problam la tha vaetorixar Itaalf* 

In ordar to ba abla to axtract a larga dagraa ot parallallam from 
an algorithm, tha vaetorixar will ba complax* Lika tha 
vactorlxara daacrlbad abova, moat vactorlxara oparata upon naatad 
Itaratlva atructuraa (£•£• naatad FOR loopa) and thara ara many 
apaclal caaaa which can fruatrata attampta to fully axtract tha 
parallallam that la praaant. 

A mora aarloua problem with tha vaetorixar approach la that 
It does not account for tha dlfferant nature of tha archltactura 
upon which tha program will ba run. In many caaaa, algorlthma 
which ara optimal on a acalar (conventional) procaaaor ara not 
eultable for Implementation on a par'illel procaaaor. In order to 
affectively program a parallel procaaaor It la necaaaary to 
''think parallel.'' A vectorizing compiler hldea this fact from 
the user; thus, a programmer who uees auch a compiler may be 
reluctant to change his programming practices, believing 
erroneously that the compiler will do aa well as he would. These 
reasona discourage the use of a vectorizing compiler for a 
parallel matrix processor language. 

C.2.2 Direct Specification of Implementation 

The second approach to Implementing a hlgh-‘level language 
system is to design a language which fully exposes the 
architecture of the machine to the user. In a sense, the result 
Is a ''high-level assembly language.'' The following sections 
describe (In alphabetical order) some languages In this category. 
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C.2.2.1 CFD 

CFD !• a Fortran dialect that waa developed by the 
Computational Fluid Dynamica branch of NASA Amea Research Center 
for the ILLIAC IV[19]* It was designed for the applications area 
of fluid flow analysis for which programs which had previously 
been coded in standard FORTRAN. 

CFD provides two forms of variables: CU (control unit) 
variables, which hold scalars, and PE (processing element) 
variables, which hold vectors. The first dimension of a PE 
variable is always 6A elements long and is represented by an 
asterisk. For instance, the following statements declare a 64- 
element vector and a square 64x64 matrix: 

*PE INTEGER X(*) 

*PE INTEGER MAT(*,64) 

(The leading asterisk appears in all CFD statements except 
assig ment statements.) Scalar varlablei are used as in standard 
FORTRAN. Array variables may be used as scalars (with one 
element selected) or as vectors of length 64 (that is, with every 
element along the first dimension selected). Hence, given the 
above definitions, the following would store the first column of 
"MAT'' in "X": 

X(*) - MAT(*,1) 

Index arithmetic may be used to reposition data by circularly 
shifting it through the array; for instance. 
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X(*) - X(*+l) 
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rotatot cha vector ''X'" one poaltlon to the Left. When the 
first subscript of an array Is an asterisk, the second subscript 
(If any) may specify a vector expression. For Instance: 

MAT(*,X(*)) i MAT(l,X(i)), , X( 64 ) ) 

Two provisions are made for selecting a subset of the 64- 
element vector. First, a parallel conditional statement may be 
used; for example, 

*IF ((A(*) .LT.O. )) A(*) - -A(*) 

takes the absolute value of the vector A by storing only into 
those elements which are leas than zero. Second, the 64 
processors in the ILLIAC IV can be explicitly turned on or off by 
manipulating the logical vector ''MODE'': 

MODE - (-A(* ) .LT.O.) 

turns off all processors except those where the value of ''A'' is 
less than zero. 

The special operators ".ANY.", ".ALL.'', ".NOT ANY.'', 
and ''.NOT ALL.'' can be used to construct scalar logical 
expressions from array logical expressions by performing the 
indicated operation (e.^. ''.ANY.'' returns ''.TRUE.'' if any 

element of Its argument vector is true* The ''.SHL.'' and 
''.SHR.'', and ''.RTL.'' and ''.RTR.'' operators perform left and 
right shifts and rotates (respectively) on bit vectors. 

Individual bits can be manipulated with the ''.TURN ON.'' and 
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CFD also contains provisions for transferring data between 
the array memory and the main control unit. This is performed by 
the TRANSFER statement. For instance, the following statement 
transfers eight elements of the vector "TEMP" into the control 
unit array " I " : 

^TRANSFER (8) l-TEMP(l) 


Although CFD permits the construction of extremely efficient 
programs, its heavy reliance on the structure of the the 
underlying machine (in this case, the ILLIAC IV), particularly in 
the number of processing elements and the vector nature of the 
machine, make it unattractive as the basis for a new language. 

C.2.2.2 DAP Fortran 

DAP Fortran is a Fortran dialect for the Distributed Array 
Processor [ 20 1 . The DAP was designed to be connected to a host 
computer as a memory module with internal processing 
capabilities. DAP Fortran reflects this design. 

A complete program consists of a main program and set of 
subroutines written in standard Fortran for the host computer, 
along with a set of subroutines written in DAP Fortran. The host 
computer loads and starts the DAP; thereafter the two programs 
can operate asynchronously. Communication is carried out through 
a COMMON block. (The host processor can access the DAP as a 
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memory unit at all tlme8» evan whan tha DAP la procaaalng data*) 

DAP Fortran provides two basic data types: vectors and 
matrices. Arrays of higher orders (that Is* with more 
dimensions) are represented as Indexed sets of vectors or 
matrices* The size of a vector (or the dimensions of a matrix) 
must be the same as the hardware array size* The array 
dimensions are not explicitly stated when declaring the array; 
for example* the two**dlmens Iona] array ''A'' would be declared 
with: 

REAL A( * ) 

Two different data representations are used for vectors and 
matrices; the representation Is automatically changed when the 
language semantics call for it* 

DAP Fortran permits elements In a vector or set to be 
Indexed In several ways* First* a scalar Index may be used as In 
standard Fortran* Second* an Index may be omitted; If this Is 
done the entire range of that Index Is selected* Third* the 
notation 

A(*I) 

may be used; this specifies that the element selected by ''x'' is 
to be expanded to fill the entire vector* These methods can be 
combined; for example, the expression: 

A( **I) 

returns an array the size of ''A''* every column of which Is the 


f 

! 



/ 


220 


SdniQ A8 • 


OWGINAl- PftSE B 
OP POOR QUALITY 


DAP Foreran concalna a number o£ useful facilltlesi 
particularly array Indexing facilities. Howeverj the underlying 
structure of the machine is evident (especially with respect to 
array declarations). Also» DAP Fortran contains no input-output 
facilities; instead, this is accomplished by the host processor. 
Finally, DAP Fortran does not remedy many of the problems 
associated with Fortran and its basic syntax. These factors 
discourage the use of DAP Fortran as the basis for a general 
parallel matrix processor language. 


C. 2.2.3 Glypnir 

Glypnir was the first high-level language sucessfully 
implemented on the ILLIAC IV[21]. It is based upon Algol 60, 
with extensions to allow the programmer to explictly specify the 
parallelism of his algorithm. 


Glypnir provides two major categories of variables: CU 
(control unit) variables which are single words, and PE 
(processing element) variables which are swords (64-word items). 
Vectors of words or swords may also be defined. There are no 
higher-order arrays. The statements: 

CU INTEGER Cl 
CU REAL VECTOR Z[100] 

PE REAL A 

PE REAL VECTOR V[100] 

declare the variable ''Cl'' to be a scalar integer, ''Z'' to be a 
100-element array of real, ''Z'' to be a sword of real (actually, 



221 

a 64-alaoant vector of real), and ''V'' to be a 100-element 
vector of sworda (actually, a 100x64 matrix). In addition to 
these types, the type ''BOOLEAN'' may be used to define 64-bit 
Boolean variables. These are stored in the scalar memory, and 
there is a correspondence between every processing element and 
every bit in the Boolean word. 

PE variables are never indexed along the ''parallel'' 
dimension. An index expression for the non-parallel dimension 
may be a scalar expression or it may involve PE variables. For 
Instance, if ''i'' is sn integer sword with values (1^ ■ 0, 

^1 " ^63 " » than the expression ''Z[I+ll'' would reference 

the following components of ''Z'': (1,0), (2,1), (3,2), ..., 
(64,63). This is referred to as a slice . 

Although Glypnir does not provide any means for indexing an 
Individual member of a sword, it does provide a means for 
accessing the individual bits within each word. For instance, 

Che expression: 

A. [0:20] A. [21 : 10] + 1 

will cause Che 10-bit field starting at bit 21 of A to be added 
Co I and stored in Che firsc 20 bics of A. (If A is a sword, 

Chis is done simultaneously for every word of the sword.) This 
allows for dense packing of Che (limited) available main memory. 

Glypnir provides a ''pointer'' data type for dynamic memory 
allocation. Blocks c: words and blocks of swords may be 
allocated and deallocated. A pointer variable may be either a 
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simple variable or a sword of poincerst There are cwo types of 
pointers: those which can point anywhere In memory and those 
which can only point to locations within a given memory module* 

Glypnlr extends the Algol 60 control'*! low constructs for 
parallel expressions* Conditionals may Involve swords; If so* 
then an enable pattern Is set during the execution of each 
''arm'' of the conditional to mask the execution in each 
processing element. The Iteration constructs are extended to 
swords as well - the processor continues to loop until the 
controlling expression Is not satisfied In any array elements* 

Glypnlr provides for the declaration of subroutines; 
however* recursion Is not permitted and all arguments are passed 
''by value*'' Subroutine arguments may be words* swords* or 
slices* and subroutines may return either word or sword values* 

The structure of Glypnlr is very significantly Influenced by 
the underlying hardware (the ILLIAC IV)* The lack of an Indexing 
mechanism along the parallel dimension makes It a highly 
ma ch 1 ne**dep e nde n t language* This fact* coupled with its vector 
nature, make it unsuitable as the basis of a new language for the 
class of parallel matrix processors* 

C. 2*2*4 IVTRAN 

IVTRAN is a Fortran compiler for the ILLIAC IV[22,17], It 
was designed for use with a vectorizing preprocessor (described 
above), but it also contains some provisions for directly 
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np«clfylng p«rall«Ilso* The principal provision is Che DO FOR 
ALL statement t 

DO n FOR ALL (1,, 1 )/s 

I Z n 

where i^, i^i ••• are aubecript variables and a apaclfies the 
range over which they will vary. (n is a statement number which 
defines the end of the loop*) Within the body of the DO FOR ALL 
statement the control indicias may only appear in aaslgnmenta (or 
conditional assignments) and all index expressions must be of the 
form 

I 

or I -f C 

or I - C 

where ''x'' la one of the controlling indices and ''C'' is an 
expression not depending upon any of the controlling indices. 

I VTRAN also provides a syntax for specifying in detail the 
memory allocation for an array. Array dimensions may be skewed 
or aligned within ILLIAC processing elements, depending upon the 
nature of the problem. 

Because the alignment of arrays places limitations upon the 
use of the Fortran EQUIVALENCE statement, two new declaration 
statements are provided. OVERLAP is used to overlap array 
allocations, thereby saving memory space, and DEFINE is used to 
define new arrays (with different Index ranges) that correspond 
to previously-allocated arrays in a specified manner. Together, 
OVERLAP and DEFINE provide most of the functionality of the 




EQUIVALENCE ttatemant* 

IVTRAN provides mechanlsroe for specifying parallelism wichin 
DO loops in a fairly machine-independent fashion. However* for 
efficient program construction the programmer must deal very 
closely with the ILLIAC IV architecture in the area of array 
declarations, particularly concerning the alignment or skewing of 
array dimensions across the (one-dimensional) array of processing 
elements. This strong coupling to the underlying architecture 
limits IVTRAN's suitability for implementation on a parallel 
matrix processsor. 

C.2.2.5 Direct Implementation Specif ication t Conclusions 

A programmer who is familiar with the machine architecture 
can write extremely efficient programs in a language which 
directly specifies the low-level implementation. Unfortunately, 
languages designed for a specific machine are usually very non- 
portable. In addition, it is somewhat undesirable that 
programmers be concerned with the specific details of the 
hardware implementation. 

A survey of users' experiences with the ILLIAC IV[23] 
Indicated that while users preferred to be able to directly 
express the parallelism in their programs, the need to coerce 
their algorithms to fit the underlying machine structure (as the 
available languages, especially Glypnir and CFD, required them to 
do) was considered a drawback. These reasons discouraged the use 
of the direct specification of the low-level implementation in 
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Ch« languagt talaccad for parallal macrlx procaaaora* 

C.2.3 Diract Spacif Icatlon of Parallaliim 

Tha third approach to parallal languaga dailgn la baaad upon 
thia Idaa* Languagaa in thia catagory parmit tha diract 
apaci f icat ion of parallal oparationa without raquiring tha 
programmer to be intimately acquainted with tha underlying 
hardware . 

In view of tha criteria eatabliahed above for a ''good'' 
programming language, a language which ia designed according to 
this third philosophy (that is, one which permits specification 
of parallelism without forcing the programmer to specify the 
exact hardware Implementation) is highly desirable* A number of 
languages in this category already exist. The following sections 
discuss these languages and their suitability to languages (in 
alphabetical order) and their suitability for implementation on a 
parallel matrix processor. 

C. 2. 3. 1 Actus 

Actus[241 is a Pascal-based language suitable for scientific 
programming on a vector processor. The original target machine 
for Actus was the ILLIAC IV, but the language was designed to be 
independent of the hardware upon which it is implemented. 

The design of Actus reflects the results of the survey of 
ILLIAC IV users mentioned above[23]. Perrott lists the following 
design criteria for Actus: 
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• Th« ldlo«yncracl«f of ch« h«rdw«r« should bo hidden from cho 
user as much as possible* 

e The user should be able Co express Che parallelism of che 
problem dlreccly* 

e The user should be able Co chink In cerms of a varying 
racher chan a fixed extenC of parallel processions* 

e Concrol of Che parallel processing should be possible boch 
expllclcly and Chrough Che daCa, as applicable* 

e The user ahould be able Co Indlcace che minimum working sec 
size of che dacabase (In chose cases where che dacabase Is 
larger chan che size of Che fasc memory)* 

AcCus supporcs mosC of che ocandard Pascal cypes (Che mosc 
signlflcanc omission Is che lack of varlanc records) along wlch 
some addlclonal Cypes (shore Inceger, shore real) chac* when 
eupporced by che underlying hardware, provide more efflclenc 
memory uClllzaClon* Parallelism Is achieved Chrough che use of 
arrays - In an array declaracion, one dimension may be declared 
to be parallel by replacing che scandard Pascal subrange symbol 
wlch a colon* For example, 

var xxx: array [l:m, l**n] o^ real 

declares ''xxx'' Co be a mx n array, whereche flrsc dimension may 
be accessed In parallel* Ic is Imporcanc Co noce chac che 
programmer Is free Co choose any size for Che parallel dimension 
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- let tie* It not conttraintd by th« undtrlylng hardwart. 

Accut provldaa for cha daflnltlon of Indax aata and parallal 
cooatanta for Indexing and initializing arraya. Tha ayntax for 
both la a I", liar: 

conat parconat ■ initial : (incramant) final 
indax indaxaat ■ initial : (incramant) final 

Tha axpraaaion to the right of tha ganeratea tha following 

(ordered) aat of valuea: 

initial, initial't-lncremant , init ial-t-Zx increment , •••» final 


While index aata are uaad for tha parallel dimenaion of an 
array, Actus allowa the uee of a vector (one~dimens ional array) 
as another indax. For example, given the declarationa 

var diag: array 11:100) of integer; 

para: array TT.IOO) oj^ integer; 

the statementa 


diag : <■ 1 : 100 ; 

para[l:100, diagll:100)l :• 0 

are effectively the same as 

for 1 :■ I ^ 100 ^ 
diagll) i; 

for J :• I t^ 100 do 

paralj, diagtjTT 0; 

(where ''i'' and ''j'' are arbitrarily-chosen integer variables). 


The operators shift and rotate are provided to align data in 
a parallel expression. shift performs an end-off shift, while 
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rof p«rformt a circular rotation* Aa an axampla, tha 
following : 

indax firatSO 1:50; 

var para: array [1:100] o^ intagar; 

para [ £ irs t 50 1 :■ para [firatSO ] para[firat50 shift 50); 
ia aquivalant to 

for 1 : ■ I ££ 50 ^ 

para[i] :■ para[i] -f para[i+50); 

When parallel variables or constants are used in an Actus 
statement, the extent of parallelism must be the same for all of 
the participants* The extent of parallelism encompasses both the 
size of the various items and the way that they are accessed; 
this excludes statements such as 

a[ I: 10] a[2: 11] 

Such a statement must be written 

a[l:l0] :- a[l:10 shift 1 ] 

so that the extent of parallelism is clear. 

The smallest program unit over which the extent of 
parallelism cannot change is the assignment statement; however, 
some of the control flow statements also define an extent of 
parallelism. Once an extent of parallelism has been defined by 
such a statement, it is signified in the controlled statements by 
a sharp character 


Control statements which specify an extent of parallelism 
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Includ* p«rallal varaion of Che Paaeal while , If i and for 
scacemencs, the new while any , while all . if any . and ^ all 
etaceiaenta, and Che new within atatemenc* The within etatement 
merely defines the extent of parallelism it has no other effect 
upon Che program flow* 

Finally, Actus addresses a common problem among parallel 
processors ~ lack of sufficient high-speed (''core'') memory, 
requiring some form of automatic buffering or virtual memory. It 
provides a syntax for specifying the minimum working set size for 
an array, so chat automatic memory management won't ''swap out'' 
crucial data. 

Actus is a very attractive language for vector processors. 

It satisfies most of the criteria stated at the beginning of this 
chapter for a "good'' programming language. It is based upon a 
well-understood language (Pascal) and therefore is relatively 
easy for programmers to learn, it can be efficiently compiled, it 
does not force programmers to think at the low level of a 
particular machine architecture, and it encourages the 
development of well-structured programs. Unfortunately, Actus is 
tied very strongly to a vector architecture, making it unsuitable 
for matrix processors (^»£» Actus allows only one dimension of an 
array to be accessed in parallel). This restriction was 
addressed by Perrott in the language Actus Plus. 
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Actue Plus[25] is a ravlslon of tha language Actus, 
eliminating the one-dimenslonal restrictions of the original 
language* Arrays may be declared with any number of parallel 
dimenaions • 

Actus Plus allows considerably greater flexibility in the 
use of index seta than Actue does* Index sets may consist of a 
consecutive or skipped range (as in Actus): 

index indexset ■ l:(2)99 
a broken range: 

index Indexset ■ 1:10, 91:100; 
an arbitrary range: 

index Indexset • 1, 3, 6, 9; 
or a repeated range: 

index indexset ■ 1*10, 2*5, 1*10; 

Index sets may be combined with the operators (union), 

''*'' (Intersection), and (set difference)* The rotate 

operator may also be used (ac in Actus) to rotate the members of 
an Index set; £*£.* the following are equivalent: 

1, 5, 3, 4 rotate 1 = 3, 3, 4, 1 

Index sets play a crucial role In the specification of 
parallel expressions* Actus Plus permits an expression to 
combine any two items, provided that their extents of parallelism 
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are the same. Thus, Che following 
var 

mad: array [l:n, l:m] o£^ real; 
mac 2: array [l:m, l:n] real; 

mac 1 [ 1 : n, I :m] :■ mac 1 [ 1 : n, 1 :ml * maC2 [ 1 :m, 1 * 

is equivalent Co Che Pascal code (where ''i'' and are 

arblcrarlly-chosen Integer variables): 

for 1 :■ 1 to n do 

for j :■ 1 to m do 

mad[i,3T 5“ matl[l,jl * mac2[l,J]; 

Although Che meaning of Che above Is clear Che following similar 
case is ambiguous: 

var 

row: array [l:n] o_f real; 
mac: array [l:n,dn] of^ real; 

maC[l:n« l:n] :> maC[l:n,l:n] * row[l:n]; 

because it Is not clear whether the multiplication should be 
performed along Che rows or Che columns of ''mat''. The 
ambiguity Is resolved by using an Index set: 

Index iset ■ l:n; 

mac [ 1 : n, Isst 1 •" mat [ 1 : n » iset ] * row[lsec]; (* row mult *) 

mat [ Isec » 1 * nj :■■ mat [ iset , 1 :nj * row[lsec]; (* column mult *) 

The while , If , and case statements In Actus are also 
available In Acf.us Plus. Since these control constructs affect 
the extent of parallelism, a sharp^slgn notation (similar to that 
in Actus) is used to represent the actual extent: 

if a[l:n,l;ml <> 0 then 

a[#l,<^2] :- aI//T7TI] + 1; 
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Actue Plus is an attractlva languaga. It provldaa for the 
direct spec if icat ion of parallelism without forcing the 
programmer to know the detailed architecture of the machine on 
which his programs will run* The generalized index sets, and the 
flexible operators provided for manipulating them could be 
somewhat expensive to implement on a machine with a limited 
interconnection network* Nonetheless, Actus Plus would be a 
strong candidate for Implementation on a parallel matrix 
processor* It did not influence the design of Parallel Pascal 
because it was not specified in time; in addition, no research 
results from an implementation of Actus Plus were available* 

C*2*3*3 Proposed Extensions to ALA 

The language ALA was proposed by Zosel as an extension to 
ALGOL for the STAR- 1 00 [ 26 1 * The language reflects the 
philosophies of APL and ALGOL-68* 

Vector extensions to ALGOL are implemented in a natural way: 
a vector may be used wherever a scalar may be used, provided that 
there is an obvious interpretation of its meaning* Operands in 
an arithmetic expression must be conformable: either they must be 
the same size or one must be a scalar* The set of primitive data 
types includes all of the data types defined by the hardware, 
including 32- 64- and 128-blt floating point representations (on 
the STAR-100), the various integer representations, and bit 
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Control ttatoaonts (£•£• conditional atatamanta and loopa) 
muat hava acalar control varlablaai howavar» tha functions 
''allof'' and "anyof'' ara provldad to raduca array axpraaalona 
to almpla Boolaan valuaa* 

ALA provldaa uaer-accaaalbla daacr Iptora for manlulatlng 
vactora* A dascrlptor "UESC'' may be aaaoclatad with a vector 
''VECT'' by one of the following two atatementa: 

DESC -> VECT 

DESC -> VECT[sllce] 

In the first form* the descriptor refers to tha entire ''VECT''’ 
array; In the second, It refers only to a subset of tha elements 
In ''VECT'' (the subset la determined by the ''slice''; the 
format of the ''slice'' Is defined below). During the course of 
execution, the size of the vector referred to by Che descriptor 
may change; however, this change will not be reflected la the 
original array. Descriptors are also used when an entire array 
Is passed as a parameter to a subroutine. Rather Chan passing 
Che array, the called routine receives a descriptor for the 
array. 

ALA permits Indexing by a scalar, a Boolean set, a set, a 
sparse set (a special STAR-100 capability), or a ''slice.'' A 
slice may have one of two forms: 

I : J 
I; J 
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In the fleet case , itema "i'' through ''J'' are selected* In 
the second caaoi all Items except the first ''i'' and the last 
''J*' are selected. All forms of indexing are valid on both 
sides of an assignment statement, and indexing may be performed 
on expressions as well as simple variables. 


ALA has some desirable features; in particular, the ability 
to deal with vectors in the same fashion as (and in combination 
with) scalars is very appealing. However, ALA is very heavily 
weighted coward implementation on the STAR-100, and it includes 
features which may be expensive to provide on other machines* 
These Include Boolean indexing, the use of sparse vectors, and 
Che large runtime variability of the size of vectors* Also, a 
language designed for a vector processor is dissimilar in many 
ways from a language for a matrix processor* 


C.2.3.4 APLISP 

APLISP[27] is a language for image and speech processing* 

Its target machine is the partit ionable SIMD/MIMD system PASM[28] 
but Che language s machine independent* 

The syntax of APLISP is similar in many ways to chat of 
Pascal* Deviations from Pascal include the definition of two new 
fundamental data types (BYTE and INDEX), a flexible array 
indexing scheme, and conditional control statements* 


Arrays in APLISP are viewed as a set of named objects, each 
of which is an ordered n-tuple consisting of the index (or 
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indices) and the value* (As an examplSi for a one-dlmenslonal 
array, the objects are ordered pairs (i,x) where is the index 
and ^ is the corresponding value*) Index sets are used to select 
subsets of these r^-tuples* For multi-dimensional arrays, sets of 
index n-tuples may be specified by a Cartesian product or 
concatenation of two index sets* 

Index sets may be used in assignment statements on both 
sides of the expression* Index sets which appear on both sides 
of an assignment are forced to correspond to one another; hence, 
the assignment A[U] :■ B[U] Implies that for each u U, A[u] :■ 
B[ul * 


APLISP provides the WHERE statement for parallel conditional 
evaluation. Execution is controlled by a conditional expression 
over an index set* Within the body of the WHERE clause and 
optional ELSEWHERE clause the range of the index set is 
restricted to only those in-tuples for which the conditional is 
true or false, respectively. 

APLISP provides a flexible mechanism for expressing 
parallelism without consideration of the underlying machine* The 
concept of index sets is a very powerful one (although as with 
APL, the uninitiated may object to the concise and highly- 
symbolic format). The runtime-dynamic shape and configuration of 
the index sets may pose an implementation problem on processors 
with restricted interconnect networks (£*5.« matrix processors 
with simple near-neighbor connections). Nonetheless, APLISP has 
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oany attractive feature*. It did not Influence the design of 
Parallel Pascal because it was not specified in time; In 
addition, no research results from an implementation are 
available . 


C.2.3.5 Fortran 8X 

At the present time, the X3J3 committee of the American 
National Standards Institute is considering proposals for 
extensions to For t ran [ 2 , 29 1 . All of the proposed changes are 
still subject to change, so it is impossible at this time to 
determine the form of the new language (often referred to as 
’’Fortran 8X”). However, it is instructive to consider some of 
the proposed extensions in the realm of array indexing and 
parallel processing. 

The current proposals permit the use of unsubscr ipted array 
names in arithmetic expressions on either side of the assignment 
'’symbol The evaluation and assignment is considered to 

be simultaneous for all array elements. (This definition 
facilitates the implementation on a parallel processor, where the 
evaluation and assignment simultaneous, as well as on a 
conventional serial machine.) 

When subscripts are specified, the special symbol is 

used to represent the entire range cf the array. For example: 

A(l,*) - select row 1 

A(*,l) ** select column 1 

A(-*,l) - select column I in reverse order 
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Finally, an array aaction may ba apacifiad by a doublat or 
crlplec : 

V(l:5) - aelect V( 1) , V(2), V(3), V(4), V(5) 

V(l:K) - aalocc V(l), V(2), ... , V(K) 

V(l;5;2) - select V(l), V(3), V(5) [atap by 2] 

Other capabilities which are under consideration are more 
complicated array sections, vector indexing for arrays, an 
IDENTIFY statement (to restrict the number of array elements 
which are active when an explicit index expression is not given), 
and a conditional assignment (WHERE) statement. 

The proposals for Fortran 8X are of interest, because 
Fortran is one of the most widely-used high level languages in 
the field of scientific computing. However, it would be unwise 
to adopt the current proposals for Fortran 8X as the basis for a 
new language at this time, since it is likely chat there will 
still be significant revisions to the language before a standard 
is adopted. Until some of the other problems with Fortran can be 
satisfactorily resolved (for example, its lack of facilities for 
structured programming), a language based upon Fortran and the 
proposals outlined above is not suitable for implementation on a 
parallel matrix processor. 

C.2.3.6 Parallel Extensions to LRLTRAN (Fortran) 

LRLTRAN is an extended Fortran in use at the Lawrence 
Livermore Laboratories in California. To accomodate the STAR- 
100, a number of vector extensions were made to the language ( 30 1 . 
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The resulting language provides tor the specification of 
efficient manipulation of vector quantities. 

The one-dimensional nature of LRLTRAN is very apparent in 
the declaration of parallel variables. Theae variables are 
explicitly declared as vectors, «.*£,• 

VECTOR A(99) 

declares that ''A'' is a vector with 100 elements (the lower- 
bound of a vector index is always zero). Vectors so defined may 
be used in arithmetic expressions (with the expected results), 

£•&* 

A ■ A + I 

Increments each element of the vector ''A'' by 1. 

In addition to declaring vector storage, one may also define 
vector descriptors ; 

VECTOR (BPTR,B) 

In this case, the variable ''BPTR'' is a user-accessible 
description of the address and size of a vector. After setting 
''BPTR'' appropriately, the desired data may be accessed as a 
vector via the descriptor ''B''« For example, if ''BPTR'' points 
to a 7-word area beginning at address 1000, then the statement 

B • 1 

will set words 1000. . 1006 to 1. Operators are provided to 
convert scalars to vector descriptors and vice versa and to 
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datcrmln« th« langth of « veccor* 

Vactors (or vaccor daacr Iptora ) which partlcipata in 
aseignmant atatamanca naad not ba the aama alza* Scalars are 
automatically axtend''<d to vectors during an assignmant. If both 
tha la£t> and right-sides of tha assignmant ara vactors, tha 
right-hand-eida is completely evaluated and the results ara 
assigned one-by-ona until one of. the two vectors is axhaused. 

Vectors may be subscripted with a scalar, a vector, a 
contiguous range of elements, or a set (bit-vector); they may 
also be treated as sparse vectors* 

Finally, LRLTRAN contains a number of intrinsic functions to 
provide for summing along vectors, merging vectors, etc* The 
implementation of these is somewhat unfortunate: unlike normal 
Fortran intrinsic functions, the user cannot override the 
standard definitions with his own* Even if he supplies a 
function definition the compiler will use the predefined 
intrinsic function* 

LRLTRAN is a flexible language for dealing with the STAR- 
100, but because of its strictly-vector nature It Is not well 
suited for a matrix processor* Many of its facilities (such as 
sparse vectors and vector indexing) are directly related to the 
hardware capabilities of the STAR -LOO and may be very expensive 
on a processor with a more rigid structure* 
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C. 2*3.7 PaacalPL 

P«scalPL[31] is a Paacai-baaad languaga which facllicataa 
parallal Image procaaaing. Ica design waa Influenced by Che 
archlcecturea of contemporary parallel arrays. It Is presently 
available at the University of Wisconsin, Madison, as a 
translator which converts PascalPL programs to standard Pascal 
programs * 

A PascalPL program conaiats of a standard Pascal program 
which concalns parallel procedures* The parallel procedures 
themselves may contain a mix of standard Pascal statements and 
parallel constructs. All parallel constructs are distinguished 
by the presence of two leading vertical lines the 

standard symbol for parallelism). 

A parallel procedure Is introduced with the declaration 

I I procedure procedurename ; 

At some point after this (between which there may be standard 
Pascal statements), a ''dimension declaration'' must be placed to 
define the bounds over which operations take place: 

I Id^ [0 . . 127 , 0. . 127 ] ; 

Optional fields also declare the data type (either Integer - 
which is the default - or Boolean), the association of arrays 
with set names (sets of arrays), and the index mapping from the 
input array set to the result array set. Once the dimension has 
been defined, the following parallal constructs nay be intermixed 
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with standard Pascal statanantst 

I I raad C . . . ) ; 

I I wrlta ( . f . ) ; • 

I I sat arrays_asslgnad_to :■ coaipound_of__ar rays ; 

I I ii^compounJ of arra7» Inaquallty 
I I than arrays_modlf lad 

I l alaa arraya~modif iad_on_f allura ; [optional] 

I I bordar i- borJartypa ” 

The I I read and I I wr its constructs perform input and output 
of arrays or subsets of arrays* The l [ border statement defines 
the value that Is to be used when array Indexing lies outside the 
declared array dimensions. 

The most significant feature of PascalPL is its array 
indexing mechanisms. Array operations are performed by the I I sat 
and 11^ constructs* The | | set instruction unconditionally 
performs array assignments. The left-hand-side of the assignment 
specifies one or more result arrays; the right-hand-side 
specifies an array expression* The array expression may contain 
scalar variables (always proceeded in this context by a 
character)^ constants^ and array index expressions* A simple 
example is : 

I I set arrayl :■ 2*array2 + array3 - J/mean; 

which performs (element-by-element) addition of 2 times 
''array2'' with ''array3'', subtracts from each element the value 
of the scalar variable ''mean'', and stores the result (element 
by element) In ''arrayl''* 


PascalPL provides even greater flexibility in array 



Uhlbli JAL PAGf- IS 

OF POOR QUALITY 


•xprtsslont by p«rmicclng ch« tp«ci£ Icac ion of neighborhood 
operation*. A specified set of near-neighbors may be 
Individually weighted and combined by a specified operation. For 
instance, supposo that it is desired to compute 

b[l,Jl ati-l,j-U + aU-l,J+ll *(i-H,J-U a[i + l,j+l] + 4*a[l,j 

for all elements of ^'a'' and ''b''. This can be accomplished 
with the following PascalPL etatement: 

I |£e£ b a[ + (-l:-l, -1:1, l:-l, 1:1, 0:0*4)]; 

This mechanism ia generalized even further to permit 
thresholding; the following performs the same sum, but only 

includes the value of the center if it is greater than 32: 

I I set b :- a[ + (-l:-l, -1:1, 1:-1, 1:1, 0:0*4>32)j; 

The conditional structure ( I | ij^) can be used to perform 
simple modifications to arrays on an element-by-element basis, as 
determined by the controlling conditional. For instance, to 
triple the value of all elements in the array ''b'' if the 
corresponding elements in array ''a'' are non-zero, the statement 
would be: 

I I if, a <> 0 
I I then b* 3 ; 

PascalPL is a very intriguing language. It has a compact 
and powerful notation which is capable of representing many 
desirable operations which can be efficiently performed by a 
parallel matrix processor. The language appears to be easily 



243 


owamAL PA® 

OF POOR QUA>-'^ 


extanslbl* to arraya with any numbar of dlmanalona* Tha ability 
to mix PaacalPL and atandard Pascal within tVia same program is 
also a dafinita advantage* Its biggast drawback may be its 
biggest feature - the symbolism used to express parallel 
operations* Tha operations that are specified for arrays are 
syntactically and semantically different than similar operations 
specified in standard Pascal* The conditional statement also 
operates upon arrays with a different syntax; multiplying an 
array ''xyz'' by 3 is done by 


xyx*3 


in an statement, but by 

1 I set xyz : • xyz*3 ; 

in an assignment statement* Thus, while PaacalPL is a viable 
candidate for a parallel matrix processor such as the MPP, it 
seems desirable that a different approach, one which does not 
significantly distinguish between parallel and scalar operations, 
be taken. 


C* 2. 3*8 VECTRAN 

VECTRAN was proposed by researchers at IBM as an extension 
to IBM FORTRAN IV[32]* It was designed as an upward compatible 
extension to FORTRAN (66) for scientific applications 
p rog rammi ng * 

The declaration statements RANGE and IDENTIFY are used to 
specify the array elements which participate in an operation. 
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The RANGE stscemetit can be used to restrict the range of a 
parallel operation to some subset of the array; for example: 

RANGE /N,M/ A(10,10), B(15,25) 

• • • 

N ■ 5 

M - N+2 

• • • 

A - 2.5*A + B 

Only the 5x7 subarraya of A and B participate In the parallel 
computation* 

The IDENTIFY statement permits the rf'def Initlon of axes In 
the array, so that well-'def Ined substructures of an array may be 
defined and used* For Instance, the elements along the diagonal 
of a two-dimensional array may be ''Identified'' with the 
elements of a one-dlmenslonal array of the appropriate size* 

Vector indexing Is permitted In VECTRAN* The semantics are 
similar to APL* The order of the values In the vector is 
s ignlf leant * 

Parallel conditional control Is provided by the WHEN and AT 
statements* These statements differ In the order in which 
evaluation Is performed* WHEN fully evaluates the conditional 
expression and each controlled expression, and then performs 
conditional assignment* AT evaluates the conditional expression 
and then conditionally evaluates the appropriate controlled 
expression* Hence, WHEN performs conditional assignment while AT 
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performs conditional evaluation* 

VECTRAN also provides functions for manipulating data 
logically (PACK and UNPACK) and arithmetically (£•£,* matrix 
multiply) . 

VECTRAN has a number of desirable features. It la based 
upon a well-known language (Fortran, albeit Fortran 66), it 
provides for flv^xlble manipulation of data, and It Is targeted 
toward numerical applications* Unfortunately, VECTRAN does not 
remedy many of the problems in Fortran - poor control flow 
constructs, the lack of user-defined data types with their 
associated type checking, and Fortran's generally poor syntax* 
Also, Its Indexing mechanisms, particularly vector Indexing, are 
complex to Implement; this may seriously Impact the efficiency of 
an Implementation on a parallel matrix processor. 

C.2.3.9 Direct Parallelism Specification; Conclusions 

A language which permits the direct specification of 
parallelism, without requiring the specification of the low-level 
implementation, is very attractive. However, none of the 
languages described above (with the exception of Actus Plus, 
whose full description was unknown at the time of the Initial 
language survey) is entirely suitable for implementation on a 
parallel matrix processor such as the MPP. The languages are 
either too vector-oriented or too general for efficient 


implementation 
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